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Coupled Wave Theory for Thick 
Hologram Gratings 


By HERWIG KOGELNIK 
(Manuscript received May 23, 1969) 


A coupled wave analysis 1s given of the Bragg diffraction of light by 
thick hologram gratings, which ts analogous to Phariseau’s treatment of 
acoustic gratings and to the ‘‘dynamical’’ theory of X-ray diffraction. The 
theory remains valid for large diffraction efficiencies where the incident 
wave is strongly depleted. It is applied to transmission holograms and to 
reflection holograms. Spatial modulations of both the refractive index and 
the absorption constant are allowed for. The effects of loss in the grating and 
of slanted fringes are also considered. Algebraic formulas and their nu- 
merical evaluations are given for the diffraction efficiencies and the angular 
and wavelength sensitivities of the various hologram types. 


I. INTRODUCTION 


Holographic recording in thick media (“‘volume recording’’) is of 
particular interest for high-capacity information storage,’* for color 
holography* and for efficient white-light display of holograms.’ ” The 
high efficiency of light conversion which is attainable with thick di- 
electric holograms is also important for microimaging, and it may make 
it practical to use holographic optical components (for example, gratings 
or fly’s eye lenses) in a variety of optical systems. 

In thick holograms it is light diffraction at or near the Bragg angle 
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which leads to efficient wavefront reconstruction. This is true for both 
transmission and reflection holograms, and both types are considered 
in this paper. The (volume) record of the holographic interference 
pattern (fringe pattern) usually takes the form of a spatial modulation 
of the absorption constant or the refractive index of the medium, or 
both. Modulations of the absorption constant are produced in con- 
ventional photographic emulsions and in photochromics, while newer 
materials, like dichromated gelatin’®’’’ lithium niobate,’ or photo- 
polymer materials’® yield modulations of the refractive index. 

This paper considers the properties of all these types of thick (or 
“deep’’) holograms. Of particular interest is their efficiency of convert- 
ing light into the useful reconstructed wave (diffraction efficiency) and 
the angular dependence of this diffraction efficiency as the incident 
light deviates from the Bragg angle. We are also interested in the wave- 
length dependence and in the way the diffraction properties are changed 
in the presence of loss or a slant of the fringe pattern with respect to 
the surface of the recording medium. 

Leith and his associates, and Gabor and Stroke have already con- 
sidered some of the properties of thick holograms, in particular the 
angular and the wavelength dependence of the diffracted light.’*"’ Their 
theories are essentially linear or perturbational theories which use the 
Kirchhoff integral or the first Born approximation with the basic 
assumption that the incident light wave is not disturbed by the dif- 
fraction process. Their results are valid as long as this assumption is 
good. For high diffraction efficiencies (like 90 percent) the incident 
wave is strongly depleted and another approach is called for. One such 
approach is to use electronic computers to solve the relevant compli- 
cated electromagnetic problem accurately. Results of such computations 
are available for special cases. Klein, Tipnis, and Hiedemann have com- 
puted data for light diffraction by ultrasonic waves,’’’” and Burckhardt 
has reported results for dielectric hologram gratings.’*'*? The method 
of Bathia and Noble” is another approach in which they employed 
integral equations to analyze acoustic diffraction of light. 

Yet another approach is the use of a coupled wave theory, which is 
the subject of this paper. Such a theory can predict the maximum 
possible efficiencies of the various hologram types (results which one 
cannot hope to obtain from linear theories), and the angular and wave- 
length dependence at high diffraction efficiencies. Following Phariseau,” 
coupled wave theories have been successfully used in the treatment of 
light diffraction by acoustic waves” and by electrooptic gratings” 
where very similar diffraction processes are at work as in holography. 
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Closely related to the diffraction in thick holograms are also the dif- 
fraction of electrons in lattices and the diffraction of X-rays in crystals. 
The dynamical theory of X-ray diffraction”* is also a theory of coupled 
waves and its application to holography has already been suggested.” 

We have earlier reported some of the results and an outline of the 
coupled wave theory for hologram gratings.”*’*” Here we propose to 
give further results and a more detailed account of the basic assumptions 
and the analysis. We give analytic formulas for the various hologram 
types as well as numerical evaluations which include results on the 
angular sensitivities and the influence of loss and slant. 

For simplicity the analysis is restricted to the holographic record 
of stnusoidal fringe patterns which we call hologram gratings. To some 
degree a more complicated hologram can be regarded as a multiplicity 
of such hologram gratings. 


II. COUPLED WAVE ANALYSIS 


2.1 Derivation of the Coupled Wave Equations 


The coupled wave theory assumes monochromatic light incident on 
the hologram grating at or near the Bragg angle and polarized per- 
pendicular to the plane of incidence.* Only two significant light waves 
are assumed to be present in the grating: the incoming “reference” 
wave R and the outgoing ‘signal’ wave S. Only these two waves 
obey the Bragg condition at least approximately, the other diffraction 
orders violate the Bragg condition strongly and are neglected. They 
should be of little influence on the energy interchange between S 
and R. The last assumption limits the validity of the coupled wave 
theory to thick hologram gratings. Section 6 gives a more detailed 
discussion of this limitation. 

Figure 1 shows the model of a hologram grating which is used for 
our analysis. The z-axis is chosen perpendicular to the surfaces of the 
medium, the x-axis in the plane of incidence and parallel to the medium 
boundaries and the y-axis perpendicular to the paper. The fringe 
planes are oriented perpendicular to the plane of incidence and slanted 
with respect to the medium boundaries at an angle ¢. The fringes 
are shown dotted. The grating vector K is oriented perpendicular to 
the fringe planes and is of length K = 27/A, where A is the period 
of the grating. The same average dielectric constant is assumed for 
the region inside and outside the grating boundaries. The angle of 
incidence measured 7m the medium is @. 


* A generalization to parallel polarization is given in the appendix. 
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Fig. 1— Model of a thick hologram grating with slanted fringes. The spatial 
modulation of n or @ is indicated by the dotted pattern. The grating parameters are: 
6—angle of incidence in the medium, K—grating vector (perpendicular to the fringe 
planes), A—grating period, ¢—slant angle, and d—grating thickness. 


Wave propagation in the grating is described by the scalar wave 
equation 


VWE+ KE = 0, (1) 


where E(x, 2) is the complex amplitude of the y-component of the 
electric field, which is assumed to be independent of y and to oscillate 
with an angular frequency w. The propagation constant k(x, z) is 
spatially modulated and related to the relative dielectric constant 
e(x, 2) and the conductivity o(z, z) of the medium by 


2 


= a € — joo (2) 


where c is the light velocity in free space and yu is the permeability 
of the medium which we assume to be equal to that of free space. 
In our model the constants of the medium are independent of y. The 
fringes of the hologram grating are represented by a spatial modulation 
of core: 
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€ = €& + «, cos (K-x) 
g = 0, + a, cos (K-x) 


(3) 


where ¢, and o, are the amplitudes of the spatial modulation, ¢ is 
the average dielectric constant and o, the average conductivity. « and 
g are assumed to be modulated in phase. To simplify the notation 
we have used the radius vector x and the grating vector K 


< sin ¢ 
x=ly|; K=K] 0 1]; K = 2r/A. 
zZ cos @ 


Equations (2) and (8) can be combined in the form 
k? = 6? — 2joB + WwAe* +e) (4) 


where we have introduced the average propagation constant 8 and 
the average absorption constant a: 


eS Qm(€)*/d; a = ploo/2(e)?, (5) 


and the coupling constant «x was defined as 


«= 4 (Fala)! — jncos/(e?): ©) 


This coupling constant describes the coupling between the reference 
wave F and the signal wave S. It is the central parameter in the coupled 
wave theory. For x = 0 there is no coupling between FR and § and, 
therefore, there is no diffraction. 

Optical media are usually characterized by their refractive index 
and their absorption constant. We also find it convenient to use these 
parameters if the following conditions are met 


Qrn/A>> ay = Aan/AY 11, n>, (7) 


which is true in almost every practical case. Here n is the average 
refractive index, and n, and a, are the amplitudes of the spatial modula- 
tion of the refractive index and the absorption constant, respectively 
[compare equation (3)]. \ is the wavelength in free space. Under the 
above conditions we can write with good accuracy 


B = 2rn/d (8) 
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and for the coupling constant 
kK = mn,/r = ja,/2. (9) 


The spatial modulation indicated by n,; or a, forms a grating which 
couples the two waves & and S and leads to an exchange of energy 
between them. We describe these waves by complex amplitudes R(z) 
and S(z) which vary along z as a result of this energy interchange or 
because of an energy loss from absorption. The total electric field in 
the grating is the superposition of the two waves: 


E = R@eé??* + S(ee**. (10) 


The propagation vectors » and 6 contain the information about the 
propagation constants and the directions of propagation of R and S. 
o is assumed to be equal to the propagation vector of the free reference 
wave in the absence of coupling. é is forced by the grating and related 
to 9 and the grating vector by 


6=o-K (11) 


which has the appearance of a conservation of momentum equation. 
o and 6 have been chosen to conform as closely as possible with our 
picture of the physical process of the diffraction in the grating. If the 
actual phase velocities differ somewhat from the assumed values, then 
these differences will appear in the complex amplitudes R(z) and S(z) 
as a result of the theory. 

Figure 2 shows the vectors of interest and their orientation. The 
components of 9 are p, and p, which are given by 


Pr sin 6 
OS On SB co os (12) 
pz cos 0 


From this and equation (11) follow the é-components co, and oa, 


; K . 
- ee 
6=|0)=8 0 : (13) 
ee ee 


B 


The vector relation (11) is shown in Fig. 3 together with a circle 
of radius 8. The general case is shown in Fig. 3a, where the Bragg 


WAVES IN THICK HOLOGRAMS 2915 


x K 











Fig. 2— 9 and 6, the propagation vectors of the reference wave F and the signal 
wave S, and their relation to the grating vector K. The obliquity factors cz and cg 
are indicated. 


condition is not met and the length of 6 differs from 8. Figure 8b shows 
the same diagram for incidence at the Bragg angle @ . In this special 
case the lengths of both, p and 6 are equal to the free propagation 
constant 6, and the Bragg condition 


cos ¢ — 0) = K/28 (14) 


is obeyed. 
For a fixed wavelength the Bragg condition is violated by angular 


C 


(a) (b) 


Fig. 3— Vector diagram (conservation of momentum) for (a) near and (b) exact 
Bragg incidence, ; , 
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deviations A@ from the Bragg angle 6) . For a fixed angle of incidence a 
similar violation takes place for changes AX from the correct wave- 
length \) . We write 


6 = 6 + Ad, (15) 
and 
\ = r+ Ad, 


and assume in the following that the deviations A@ and A) are small. 

Angular changes A@ have very similar effects on the behavior of the 
grating as wavelength changes A), and there is a close relation between 
the angular sensitivity and the wavelength sensitivity of thick hologram 
gratings. We get an idea of this relationship by differentiating the 
Bragg condition (14), from which results 


dA 


Do = K/4rn sin @ — 6). (16) 


The 6 — i connection shows up in the dephasing measure 3? which 
appears in the coupled wave equations and which is defined by 


v= (¢’ — 0°)/28 = K eos (6 — «) - Ky (17) 


and which has been expressed in this form using equation (13). A Taylor 
series expansion of equation (17) yields the following expression for 3 
which is correct to the first order in the deviations Aé@ and A): 


od = A@-K sin 6 — 6) — Ad: K?/4an. (18) 


Note that the deviations A@ and AX which produce equal dephasing 
é are related by equation (16). 

We are now ready to derive the coupled wave equations. We combine 
equations (1) and (4), and insert the expressions of (10) and (11). 
Then we compare the terms with equal exponentials (e7'®* and e7***) 
and arrive at 


R"” — 2jR'p, — 2jo8R + 2xBS = 0 (19) 
and . 
S"” — 2j8'c, — 2joBS + (6? — o°)S + 2BR = 0, (20) 


where the primes indicate differentiation with respect to z. The waves 
generated in the directions of » + K and 6 — K are neglected, together 
with all other higher diffraction orders, In addition we assume that the 
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energy interchange between S and RF is slow and that energy is absorbed 
slowly, if at all. This allows us to neglect R” and S’”’. We will check 
the results of the theory later for a more detailed justification of this 
last step. We can now introduce equation (18) and rewrite the above 
equations in the form 


pk! + oR = —jxS (21) 
a8? + (a + 7S = —KR. (22) 


These are the coupled wave equations which are the basis for our 
analysis. The abbreviations cp and cs stand for the expressions 


Cr = p./B = cos 0 (23) 
K 
Cs = o,/8 = cos 0 — 8 cos ¢. 


Our physical picture of the diffraction process is reflected in the coupled 
wave equations. A wave changes in amplitude along z because of coupling 
to the other wave (xR, «S) or absorption (af, aS). For deviations from 
the Bragg condition S is forced out of synchronism with R and the 
interaction decreases (0S). 

The energy balance of the coupled-wave model is described by the 
relation 


(ceRR* + csSS*)’ + 20(RR* + SS*) + j(e — x*)(RS* + R*S) = 0 
(24) 


where the asterisk denotes a complex conjugate. This is easily derived 
from equations (21) and (22) by multiplying them with R* and S*, 
respectively, and adding the results together with the complex con- 
jugate results. The presence of the obliquity factors cz and cs in the 
first part of equation (24) indicates that it is the power flow of the 
two waves in the z direction that enters the energy balance. In the 
absence of ohmic loss this power flow is conserved. The second and the 
third part in the equation describe the energy loss resulting from ab- 
sorption in the grating. They correspond to the relevant terms of cHE*. 


2.2 Solution of the Coupled Wave Equations 


It is straight forward to obtain the general solution of the coupled 
wave equations, which is 


R(z) 
S(z) 


r, exp (112) + 12 exp (722) (25) 
S; eXp (712) + Se exp (Y22) (26) 
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where the r; and s; are constants which depend on the boundary condi- 
tions. To determine the constants y; we insert equations (25) and (26) 
into the coupled wave equations and obtain 


(cry: + ar; = —jxs; | (27) 
i= 1,2 
(es¥: +at+ jO)s; = — IKP ; . (28) 


After multiplying the equations with each other we get a quadratic 
equation for ¥; 


(crv: + a) (csv; fa + 78) = —2’, (29) 


with the solution 


__lfata 2.) 
Nne= i( gad ge: 


Cr 
af 
+9 

At this point we divert briefly from the main derivation, because 
now we have the means to check the validity of neglecting R’’ and 
S” in Section 2.1. This step is justified if the conditions R” < ¢&,R’, 
and S” < a,S’ are obeyed. In view of equations (25) and (26) this 
will happen if y; « 8. According to equation (30) the above requirement 
is met if AO < 1 and if the inequalities of equation (7) are satisfied 
(which is usually the case). 

Continuing the coupled wave analysis, we have to determine the 
constants 7; and s; . To do this we have to introduce boundary condi- 
tions into our model. These are different for transmission holograms 
and for reflection holograms. Figure 4 gives an indication of this. For 
both hologram types the reference wave F is assumed to start with 
unit amplitude at z = 0. It decays as it propagates to the right and 
couples energy into S. In transmission holograms the signal S starts 
out with zero amplitude at z = 0 and propagates to the right (cs > 0). 
In reflection holograms the signal travels to the left (cs < 0) and it 
starts with zero amplitude at z = d. 

Let us first analyze transmission holograms where cs > 0. Here, 
the boundary conditions are 


RO) = 1, S(O) =0 (31) 





(2-2 -j2)—4 i |: (30) 


Cr Cs Cg Creg 


as discussed before. If we insert these boundary conditions into equa- 
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(a) . ME 
a ve 


Fig. 4— Wave propagation in (a) transmission and (b) reflection holograms. 
The reference wave R decays while it propagates to the right. In (a) the signal S 
travels to the right and gains with z, while in (b) S travels to the left and gains with 
decreasing z. The shading indicates the orientation of the fringes. 





WAVE AMPLITUDE => 


| 


WAVE AMPLITUDE — >» 


tions (25) and (26),it follows immediately that 


Yr, +r. = 1, 
and (32) 
8, + s&s = 0. 
Combining these relations with equation (28) we obtain 
8, = —8, = —jx/cs(y1 — 2). (33) 


Introducing these constants in equation (26) we arrive at an expres- 
sion for the amplitude of the signal wave at the output of the grating 


S@) = (exp (y2d) — exp (71d). (34) 


: Cs ce 2) 
This is a general expression, which is valid for all types of thick trans- 
mission holograms including the cases of off-Bragg incidence, of lossy 
gratings and of slanted fringe planes. 

The analysis of reflection holograms follows a pattern similar to the 
above. We have cs < 0 and boundary conditions given by 


Rk(O) = 1, S(d) = 0. (35) 
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The output plane for the signal wave is, now, at z = 0, and S(0) is 
the output amplitude of interest. Inserting the boundary conditions in 
equations (25) and (26) yields 


1 +m. = 1 
and (36) 
s, exp (vid) + s, exp (yod) = 0. 


To proceed with our derivation we rewrite the above relation for s, 
and s, in the form 


s,(exp (vod) — exp (yi1d)) = (8; + 82) exp (y2d) 
so(exp (y2d) — exp (yid)) = —(s, + sz) exp (vid). (37) 


Then we sum equation (28) for? = 1 andz = 2 and obtain the relation 
—jx(ri + 2) = —je = (81 + S2)(a + 98) + es(yi8; + 282). (38) 


Using the relations (37) to substitute the sum (s, + s.) for the terms 
S$; and 8, in this equation we finally arrive at the result for the amplitude 
S(O) of the output signal of a reflection hologram 


2 = ; 11 exp (y2d) — Y2 exp wah 
a oa ic/ ‘e 19 F Os exp (vad) — exp (rad) 


(39) 


This is, again, a formula of quite general validity, including off-Bragg 
incidence, loss, and slant. 

In the following sections we discuss the behavior of transmission and 
reflection holograms in greater detail, using the general formulas de- 
rived above. In these discussions a parameter of prime interest is the 
diffraction efficiency 7, which is defined as 


n= Est Ss* (40) 


where S is the (complex) amplitude of the output signal for a reference 
wave F incident with unit amplitude. 7 is the fraction of the incident 
light power which is diffracted into the signal wave. S is equal to S(d) 
for transmission holograms and equal to S(0) for reflection holograms 
in the notation of this section. But for reasons of simplicity we omit 
the arguments in the following sections. The obliquity factors cp and 
Cs appear in the above definition for the same reason they have ap- 
peared in the energy balance-of equation (24): in the absence of loss 
it-is the power flow in the z direction which is conserved. 
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For slanted gratings another important parameter is the slant factor c 
which is defined as the ratio between the obliquity factors 


C = Cr/Cs = —cos 0/cos (8) — 2¢) 


which we have expressed here, for Bragg incidence, in terms of the 
angle of incidence @ and the slant angle ¢. Figure 5 indicates lines of 
constant c as a function of 6) and ¢. For transmission holograms c is 
positive (c > 0), and for reflection holograms c is negative (c < 0). 
In the diagram transmission and reflection holograms are separated by 
the line forc = o. 


III. TRANSMISSION HOLOGRAMS 


In this section we discuss transmission holograms in greater detail. 
We give algebraic formulas and their numerical evaluations for the 
diffraction efficiencies and the angular and wavelength sensitivities of 
dielectric and of absorption gratings. This includes results on the 
influence of loss and slant. 


C=1 C24 


1/2 





6 
fo) 
1 
O 
-7/2 
-%7/2 fe) T/2 
69-——> 


Fig. 5— The slant factor c as a function of the angle of incidence 69 and the slant 
angle ¢. c is positive for transmission holograms and negative for reflection holograms. 
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It is convenient to write the various diffraction formulas in terms 
of parameters » and £, which are redefined for each grating type. In 
these parameters are lumped together the constants of the medium 
(n, a, % , @ , x), the obliquity factors (cz , cs), the wavelength, the 
grating thickness d, and the dephasing measure 3. By using v and &, 
various trade-offs become immediately apparent. 

We recall that, for transmission holograms, cg is positive and the 
output signal appears at z¢ = d. Combining equations (80) and (34) 
we obtain a general formula for the signal amplitude S of a transmis- 
sion grating 


S= - (22) -exp (—ad/cp)-e'-sin [ — PF /[1 — PA", 

vy = kd/(eres)*, (41) 
17{% _ & _ 

a 1a(2 Cs i%) 


where « is the coupling constant given in equation (9), 3 the dephasing 
measure of equation (18), cg and cs are the obliquity factors of equa- 
tion (23), a is the absorption constant and d the grating thickness. 
In the above form the parameters v and é are, in general, of complex 
value. 


3.1 Lossless Dielectric Gratings 


For completeness we give the formulas for the lossless dielectric 
grating. For the unslanted case of this grating these formulas have 
been previously obtained by several workers whose prime interest 
was light diffraction by acoustic waves.”’"”"'” For this grating type it 
is easy to include the effect of slanted fringes.* For the lossless dielectric 
grating we have a coupling constant x = 7n,/ and a = a, = 0. Equa- 
tion (41) can be rewritten in the form 


4 
g = — (2) eM sin + 9/0 + 8h), 
mn,d/dCres)*, (42) 


& = 0d/2s 


where v and é have been redefined and are real-valued. The associated 
formula for the diffraction efficiency is 


I 


v 


* Slant was also included in the treatment of dielectric gratings in Ref. 29. 
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n = sin’ (0? + &)*/(1 + £/2"). (43) 


For significant deviations from the Bragg condition the parameters 
vy and é are of equal order of magnitude, and we can take v as independent 
of A@ or AX without causing an appreciable change in the predictions 
of equation (48). In this equation the angular and wavelength deviations 
are represented by the parameter £ which can be written in the form 


£ = Aé-Kdsin @ — 4 )/2cs 
= —A)-K’d/8rncg (44) 


by using equation (18). 

The angular and wavelength sensitivities of lossless dielectric gratings 
are shown in Fig. 6, where the efficiencies as given by equation (48) 
are plotted (normalized) as a function of é for three values of v. The 
figure shows the sensitivity of gratings with vy = 7/4 and a peak diffrac- 
tion efficiency of yo = 0.5, with vy = 7/2 and a peak efficiency of mo = 1, 
and with vy = 37/4 and 7m = 0.5. We notice that the half-power points 
are reached for values near § = 1.5. There is some narrowing in the 
sensitivity curves for increasing values of v, and a marked increase in 
the side lobe intensity. 


\B 


3 4 


é 


1.0 


0,8 


0.6 


1/ No 


0.4 





Fig. 6—Transmission holograms—the angular and wavelength sensitivity of 
lossless dielectric gratings with the normalized efficiencies 4/y0 as a function of &. 
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n/No 





-5 -4 -3 -2 -1 oO 1 2 3 4 5 
AO IN DEGREES 


Fig. 7— Transmission holograms—the angular sensitivity of a lossy dielectric 
grating with »y = w/2 and Dy = 2 pmpated with that of a lossless dielectric grating 
(Do = 0), for = 30° and Bd = 


The above formulas include the influence of slant through the obliquity 
factors cr and cs . If there is no slant (6 = 7/2) and if the Bragg condi- 
tion is obeyed then cg = cg = cos 4 and equation (43) becomes the 
well known??'??’*° 


7 = sin” (xn,d/X cos 6). (45) 


By inserting the above half-power values for ~ into equation (44) we 
obtain simple rules of thumb for the angular and spectral half-power 
bandwidths of unslanted gratings: 2A0Q,  A/d, 2Ad,/d & cot 6-A/d. 


3.2 Lossy Dielectric Gratings 


Let us first study the influence of loss on the angular sensitivity 
of a dielectric grating. We assume that there is no slant (6 = 7/2) 
and therefore cg = cs = cos @. With this and a coupling constant of 
kK = 7Nn,/d we obtain from equation (41) for the signal amplitude 


S = —j exp (—ad/cos 6)-e7#*-sin (@” + 2/1/11 + 2’)! 
vy = 1n,d/d cos 6 (46) 
— = 3d/2 cos 6 = Ad-Bdsin % 


where » and £ have been redefined, and ~ has been expressed in the needed 
form with the use of equations (14) and (18). 
Equation (46) has a form similar to that of equation (42) except 
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for an additional exponential term containing the absorption con- 
stant a. This term decreases the peak efficiency and it changes the 
angular sensitivity of the grating. But this change is very small, even 
for high loss values, as illustrated in Fig. 7. This figure compares the 
grating of high loss (Dy = 2) for a parameter value of vy = 7/2, a Bragg 
angle of #) = 30°, and an optical grating thickness of Bd = 2and/d = 50. 
The loss parameter D> was defined as 


Do = ad/cos A (47) 


which is closely related to the conventional photographic density D 
(except that D, is measured in the direction of the reference wave 
given by 6). A value of D,) = 2, which is the parameter used for the 
dashed curve, represents very high loss, with a decrease of the peak 
efficiency by a factor of about 50. Still, the differences of the two sen- 
sitivity curves are very small and consist mostly of an angular shift. 
The differences are even smaller for larger values of Bd (we checked 
up to Bd = 200), and, of course, for smaller values of D, . The main 
conclusion is that the presence of loss has very little influence on the 
angular sensitivity of a dielectric transmission grating. This is probably 
because absorption influences the phase relations between the waves 
R and S very little. It agrees with observations by Belvaux.”’ 

Next let us consider the influence of loss on the efficiency of a slanted 
dielectric grating. For simplicity we assume Bragg incidence, that is, 
o@ = 0. The obliquity factors are positive and given by cg = cos 4% 
and cs = —cos (@ — 2). For this case we can write equation (41) 
for the signal amplitude S in the form 


Se ~i(22)'-exp [-3D,(1 + od] sin @* — #7 — F/*)' 
yp = wn,d/Neres)* (48) 
E= 7D(1 — ¢) 


where we have used the loss parameter D, as above in equation (47), 


and the slant factor c 
Dy = ad/cp = ad/cos 4 


Cr/Cs = —COS 0/cos (A — 2¢). 


ll 


c 


Figure 8 shows the diffraction efficiency of slanted grantings as 
calculated from equation (48). The efficiencies are plotted as a function 





CR/Cs 


Fig. 8— Transmission holograms—the efficiency of lossy dielectric gratings as a 
function of slant for » = 7/2. ¢ = Cr/cg is the slant factor. 


of the slant factor c for various values of D, , and for a value of » = 7/2 
which corresponds to the maximum attainable efficiencies. Similar 
curves for » = 7/4 and v = 87/4 and the same D, values are almost 
identical to the curves of Fig. 8, except that the efficiency scale is 
reduced to a maximum efficiency of 0.5. This implies that for the 
range of chosen parameter values the exponential factor in equation 
(48) dominates in predicting the slant-dependence of the diffraction 
efficiency. 

The results show that, for higher efficiencies, the grating prefers 
small c-values, assuming constant 6, and D, . This is a preference of 
small exist angles for S which means that we get the best efficiency 
if the signal wave leaves the grating on the shortest possible path 
after it has been generated. 


3.3 Unslanted Absorption Gratings 


When one records holograms in conventional photographic emul- 
sions one produces absorption gratings (bleaching can convert this 
into a dielectric grating). In an absorption grating there is no spatial 
modulation of the refractive index (n, = 0) and the coupling is provided 
by a modulation (a,) of the absorption constant. We have, then, an 
imaginary coupling constant x = —ja,/2. In this section we study the 
efficiencies and the angular and wavelength sensitivities of unslanted 
absorption gratings where ¢ = 7/2 and cg = cs = cos 0. From equa- 
tion (41) we obtain for the signal amplitude 
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S = —exp (—ad/c,)-e7"*-sh & — BY/(L — Bf) 


a,d/2 cos @ (49) 


Vv 


é 


where pv and é are real-valued, and equation (18) was used to express 
the parameter £ again in various forms, showing explicitly the angular 
deviations A@ and the wavelength deviation Ad from the Bragg 
condition. 

For Bragg incidence we have = 0, and obtain from the above a 
formula for the diffraction efficiency 7 of absorption gratings 


n = exp (—2ad/cos 0)-sh” (a,d/2 Cos 4%). (50) 


dd/2 cos 6 Ad-Bd-sin 6 = —43(Ad/dA)Kd tan 4, 


As we exclude the presence of negative absorption (gain) in the medium, 
there is an upper limit for the amplitude a, of the assumed sinusoidal 
modulation, which is a, S$ a. The highest diffraction efficiency possible 
for an absorption grating is reached in the limiting case a, = a for a 
value of ad/cos 6) = In 3. According to equation (50) this maximum 
efficiency has a value of ymax = 1/27, or 3.7 percent. 

Tigure 9 shows values for the diffracted amplitude S of absorption 
gratings as computed from equation (50) as a function of the modula- 
tion amplitude a, and for various values of the depth of modulation. 
For convenience we have again used loss parameters, which are Dy = 
ad/cos 6) and D, = a,d/cos 6) . D, is a measure for the amplitude 
of the spatial modulation and D,/D,; = a/a, indicates the modulation 
depth. The dashed curves for constant D) show the grating behavior 
for constant background absorption. We have plotted S on a linear 
scale in order to identify the regions of linear grating response. Note 
that a good linear response and relatively good efficiency is obtained 
if the absorption background is held constant to a value of about Dy = 1. 

Equation (49) predicts also the angular sensitivity and the frequency 
sensitivity of absorption gratings. Such sensitivity curves are plotted 
in Fig. 10 for the special case of a; = a, and values of »v = D,/2 = 1 
(dashed) and vy = 4 In 3 = 0.55. For the latter parameter value the 
peak efficiency of 3.7 percent is reached, and for vy = 1 we have a peak 
efficiency of 2.5 percent. In the figure the relative efficiencies are plotted 
as functions of the parameter ¢. We note that there is very little dif- 
ference between the sensitivity curves for the two p-values chosen. 
We have also computed the sensitivity for smaller values of v (0.2, 0.4), 
but the resulting curves differ so little from the ones shown that we 
have omitted them from the figure. The sensitivity curves are very 
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Fig. 9.— Transmission holograms—the diffracted amplitude of an absorption 
grating as a function of the modulation D; = oid/cos @ = 2v for various modulation 
depths D,/Do (solid curve) and various bias levels Dp) = ad/cos 6 (dashed curve). 


similar to those of the dielectric gratings with smaller v-values which 
are shown in Fig. 6. Again, the half-power points are reached for about 
& = 1.5. But for absorption gratings there is no narrowing with in- 
creasing values of v, and the side lobe intensity remains low. 


3.4 Slanted Absorption Gratings 


In this section we consider the influence of slant on the efficiency 
of an absorption grating. For simplicity we assume Bragg incidence 
( = 0), and describe the slant by the obliquity factors cg = cos 4 
and cs = cos (@ — 2), as before. Using equation (41) we obtain, 
for this case, the following expression for the signal amplitude S 


{cr 
Cg 


s = —(%8)' exp | —tea(2 +1) | aot + ova ten 
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p= a,d/2(crcs)? (51) 


] 1 
= ind Sb 
p= taa(4 1), 


where v and & are redefined as real parameters. We have plotted the 
slant-dependence of absorption gratings in Fig. 11 for the special case 
of a, = a, that is, maximum depth of modulation. The diffraction 
efficiency 7 is shown as a function of the slant factor c for various 
values of the loss parameter D, . These quantities are defined, as 
before, by 


I 


Do ad/cp — ad/cos OPN 
and (52) 


Cr/Cs . 


Cc 


The efficiency is seen to reach its absolute maximum of 3.7 percent 
for the unslanted grating (c = 1) and for a loss parameter of Dy = In 3. 
For larger values of D, the efficiencies reach relative maxima for exit 
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Fig. 10——- Transmission holograms—the angular and wavelength sensitivity of 
an absorption grating for a, = a (Di = Do) and values of v = Di/2 = 0.55 (mo = 
0.037) and » = D,/2 = 1 (no = 0.025). 
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Fig. 11— Transmission holograms—the efficiency of an absorption grating as a 
function of slant for a1 = a (Di; = Do). ¢ = cr/cg is the slant factor. 


angles of the signal wave which are smaller than that of the reference 
wave (c < 1), while for smaller Do-values the situation is reversed. 


3.5 Mixed Gratings 


Mixed gratings are those in which both the refractive index and 
the absorption constant are spatially modulated. This may occur in 
some recording materials (for example, as a result of incomplete bleach- 
ing, or in cases where strong absorption peaks are developed which 
cause refractive index changes according to the Kramers—Kronig rela- 
tions).* Mixed gratings are described by a complex coupling constant, 
which is given in equation (9). For the special case of unslanted gratings 
(@ = 7/2) and Bragg incidence (& = 0) equation (41) simplifies to 


S = —j exp (—ad/cos 4)) sin (xd/cos 4) (53) 


where «x is complex. From this we obtain, after some algebra, an expres- 
sion for the efficiency of mixed gratings 


n = SS* = [sin” (an,d/) cos 6) + sh? (a;/2 cos 0o)] exp (—2ad/cos 0), 
(54) 


where n, and a, are the amplitudes of the modulation of the refractive 
index and the absorption constant, and a is the average absorption 


* Such effects have recently been observed by Nassenstein (see Ref. 32). 
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constant. We note that, at least for the special case considered here, 
there is a simple addition of the intensities diffracted by the dielectric 
grating and the absorption grating respectively [compare equations (46) 
and (50)!]. The exponential factor including a insures that the formula 
does not predict efficiencies larger than 1. 


IV. REFLECTION HOLOGRAMS 


In reflection holograms the recorded fringe-planes are of an orienta- 
tion which is more or less parallel to the surfaces of the recording 
medium, and the signal appears as a “‘reflection’’ of the reference wave. 
We have illustrated this situation in Fig. 4b. It is expressed in the 
coupled wave analysis by negative values of the obliquity factor 
cs(cs < 0). In addition, the signal amplitude S of interest is obtained 
by evaluating the signal wave in the plane z = 0, which is also the 
entrance plane for the reference wave R. For reflection holograms a 
slant angle ¢ = O describes the case of unslanted gratings. Apart from 
these differences the following discussion of the detailed behavior of 
reflection holograms proceeds in a pattern similar to that of Section III, 
where we have discussed transmission holograms. 

From equations (80) and (89) we obtain a general formula for the 
signal amplitude of reflection holograms which can be written in the 
form 


R 
| 


3 
= (2) sh @ ch a)/eh (@ +» cha) 
S 
y = jud/d(Cpes)* 


Be cae: ae 2) 
g 1a Cs Joy 


sha = é/y 


(55) 


where we have again defined (complex) parameters v, and a, which 
lump together the constants of the medium (7, a, n; , a; , x), the obliquity 
factors cp and cg , the wavelength, the grating thickness d and the 
dephasing measure #. 


4.1 Lossless Dielectric Gratings 


The lossless dielectric grating is characterized by a real-valued 
coupling constant x = 2n,/d, and by zero absorption a = a, = 0. 
As in the transmission-hologram counterpart, it is easy to include the 
case of slant in the analysis. For the present case we can rewrite equa- 
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tion (55) in the form 


4 
Siz (ce) [ity + 1 — #/*)-coth & — 2") 
vy = jrmd/dCzes)* (56) 
E = —dd/2cg 


where v and & have been redefined as real-valued parameters (cs is 
negative!). 

The associated formula for the diffraction efficiency of lossless di- 
electric gratings is 


y= 1/{1 + 1 — 2/r*)/sh? (? — #4), (57) 


which also provides a description of the angular and wavelength sen- 
sensitivities of the grating. For unslanted acoustic gratings this formula 
has been previously given by Quate and his associates.” Sensitivity 
curves calculated from equation (57) are shown in Fig. 12, where the 
normalized efficiencies are plotted as a function of & for various values 
of y = const. The figure shows the sensitivity of a grating with vy = 7/4 
and a peak efficiency of 43 percent, a grating with y = 7/2 and mo = 0.84, 
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Fig. 12—Reflection holograms—the angular and wavelength sensitivity of a 
lossless dielectric grating with the normalized efficiency y/o as a function of &. 
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and the corresponding values for » = 32/4 and 7m = 0.96. For» = 7/4 
the half-power points of the grating response are reached for values of 
approximately € = 1.7. But there is considerable broadening of the 
sensitivity curves for increasing values of v, and an increase in the 
side-lobe level. 

As in equation (44) for transmission holograms, we can express the 
parameter é directly in the angular deviation Aé@ or the wavelength 
deviation Ad by using equation (18) to obtain 


£ = A@-Kd-sin (6 — ¢)/2cs 
Ad-K?d/8rnes . (58) 


These expressions can again be used to formulate rules for the angular 
bandwidth and the spectral bandwidth of the grating. 

For an unslanted grating ( = 0) and Bragg incidence we have 
Cr = —Cg = COS & , and equation (57) simplifies to 


n = th? (rnd/d cos 6>). (59) 


This is a formula which has been obtained previously for light diffraction 
by acoustic waves.**’”” 


4.2 Lossy Dielectric Gratings 


Let us first discuss the influence of loss on the angular and wave- 
length sensitivity of unslanted dielectric gratings. Here we have ¢ = 0 
and, to a good approximation 


Ce = Cos 4(1 — Aé@tan 4) = cos 8 
—cos (1 + Aé@tan 6), (60) 
—cos 6(1 + 2AX/)) 


at least as long tan @ S 1. One can show that the formula for the signal 
amplitude S, which we have given in equation (56), is still applicable 
for the present case of an unslanted lossy grating if we modify the 
parameters v and & to 


Cs 


mn,d/X COs 4% 


v 
g £ r= jDo ) (61) 
Eo —Aé@-Bd sin 6 
Dy = ad/cos 4% 


where is now a complex parameter with £ describing the angular 
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deviations and D, representing the loss. An evaluation of this formula 
is shown in Fig. 13, which shows the angular sensitivity of dielectric 
gratings for various values of the loss parameter Dy and a grating 
parameter of v = 7/2. In constrast to what we have observed in the 
case of dielectric transmission holograms (lig. 7), we see here a quite 
noticeable effect of the grating loss on the sensitivity curves. With 
increasing loss values the curves broaden in the wings, sharpen some- 
what in the center and the side-lobe level decreases. 

To study the influence of loss on the diffraction efficiencies of dielec- 
tric gratings we rewrite equation (55) in the form 


S = (22)'/ {t/y + 1 + #/’)'- coth @ + #)} 
y = jrnd/depcs)* (62) 
E= sD. = c) 


where we have written v and & as real-valued parameters in a form 
which is valid for Bragg incidence and for slanted or unslanted gratings. 
Just as in the case of transmission holograms we have used the loss 


7/70 





Fig. 13 — Reflection holograms—the influence of loss on the angular and wave- 
length sensitivity of a dielectric grating for »y = 7/2. The normalized efficiencics 
n/no are shown. The peak efficiencies are no = 0.84 for Do = 0, m0 = 0.64 for 
Do = 0.5, no = 0,28 for Do = 1, and 40 = 0,12 for Do = 2, 
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parameter D, and the slant factor c (which is now negative) 


Dy, = ad/cos 65 


(63) 
C= C2/Cs:s 


In the case of unslanted gratings the parameters »v and £ simplify to 


vy = 1n,d/d Cos 45 (64) 
& = Dy = ad/cos % . 


The results of a numerical evaluation for unslated gratings are shown 
in Fig. 14, where the signal amplitude is plotted as a function of »v for 
various values of the loss parameter D, . The curve D) = 0 gives the 
values for lossless gratings, while the others indicate the influence 
of loss. 

The behavior of slanted dielectric gratings in the presence of loss 
is shown in Fig. 15. The curves of this figure are also computed from 
equation (62) and show the diffraction efficiency as a function of the 
slant factor for y = 2/2 and various values of the loss parameter D, . 
For constant D) we notice an increase of the efficiency for decreasing 
values of the slant factor, as in the case of transmission holograms. 
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Fig. 14— Reflection holograms—the influence of loss on the diffracted amplitude 
S of an unslanted dielectric grating. | S | is shown as a function of »/r = md/d cos 4 
for various loss parameters Do. 
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Fig. 15— Reflection holograms—the efficiency of a lossy dielectric grating as a 
function of slant for »y = 2/2. c = cr/cg is the slant factor. 


Again, for given loss and a given angle of incidence short signal paths 
through the grating (that is, small exit angles) are preferred for higher 
efficiencies. 


4.3 Unslanted Absorption Gratings 


Following the pattern set in the discussion of transmission holograms 
(Section III), we again describe an absorption grating by an imaginary 
coupling constant « = —ja,/2, and proceed to study the diffraction 
efficiencies and the angular and wavelength sensitivities of unslanted 
(@ = 0) gratings. In this case equation (55) simplifies to 


fC ; 2 1 2 2\h 
S= - (2) /{é&/» + [E/r)” — 1]*coth & — v*)*} 
y = joyd/2(cpcs)* (65) 
g = Do — JEo 


where the real-valued parameters D, and & can be expressed to first 
order in the angular deviations A@ and the wave-length deviations 
Ad by 


l 


Dy, = ad/cos 4% 
£ = Aé-Bdsin 6% = 4(Ad/\)Kd. 


(66) 
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Dy is a loss parameter as before, and & is a normalized measure for 
the angular or the wavelength deviations from the Bragg condition. 

If the Bragg condition is obeyed equation (65) can be written in 
the form 


S = —D,/2[D. + (Do — D?/4)*- coth (D3 — Di/4)*] (67) 
where 
D, = 2», = a,d/cos 6 


measures the spatial modulation of the absorption constant (a,). 

For the deepest allowable modulation where we have D,; = Do(a,;=a»), 
this equation predicts the maximum diffraction efficiency nn... Which 
is possible for reflection holograms with a (sinusoidal) absorption 
modulation. We obtain mmx = 1/(2 + V3)’, or a maximum efficiency 
of 7.2 percent for D) = D, — ©. The formula reflects the experimental 
fact that, for reflection holograms of the absorptive kind, one obtains 
the largest efficiencies for high photographic densities. Figure 16 shows 
a numerical evaluation of the above formula. Here the signal amplitude 
S is plotted as a function of the modulation amplitude D, for various 
levels of loss ‘‘bias’’ D) (dashed curves) and for various modulation 
depths D,/D, . 

An evaluation of the grating sensitivity as predicted by equation (65) 
is shown in Fig. 17 for the special case of a maximum depth of modula- 
tion where D, = D, . In this figure the (normalized) efficiency is plotted 
as a function of the parameter & for various values of D, = D,. As in 
the corresponding grating for the case of transmission holograms (Fig. 
10) the sensitivity curves are seen to reach their half-power points for 
values of about & = 1.5. But in the present case of reflection holograms 
there is a noticeable broadening of the curves with increasing loss 
values D; = Dy. 


4.4 Slanted Absorption Gratings 


In this section we consider the influence of slant on the diffraction 
efficiency of an absorption grating for reflection holograms. We assume 
Bragg incidence (? = 0) and again use the obliquity factors cp = cos 9, 
and cs = —cos (6) — 2¢) to describe the slant (for reflection holograms 
we have cg < 0!). We find that equation (65) can be used as a formula 
for the signal amplitude for the present case if we modify the parameters 
to 


y = ja,d/2(ercs)? = 5 Dile)! 
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Fig. 16—Reflection holograms—the diffracted amplitude of an absorption 
grating as a function of the modulation D; = a,d/cos 6 = 2» for modulation depths 
D,/Do (solid curve) and bias levels Do) = ad/cos 6 (dashed curve). 


—E = 2D,(1 — ¢) 
Dy = ad/cos A ) D, = a,d/cos A 


(68) 


C = Cr/Cs 


where the slant factor c is negative. All these parameters are real-valued 
in the present case. For a maximum depth of modulation, that is, a, = a, 
there are further simplifications, and we obtain a simple expression for 
the slant-dependence of the diffraction efficiency 


n= —ce/{1 —e+ (1 —c+4+’)'-coth $D,(1 — ¢ + ?)*}?. (69) 


Figure 18 shows a numerical evaluation of this formula for various 
values of D) = D, . The slant factor value of | c | = 1 refers to unslanted 
gratings. In this case the maximum efficiency value 7... = 0.072 is 
approached for large D,. We note that for values of D, below unity 
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the efficiencies increase for | c |-values larger than 1 and up to about 3, 
that is, for relatively large exit angles of the signal wave. 


4.5 Mixed Gratings 


Mixed gratings are described by a complex coupling constant 
kK = mn,/X — ja,/2 (see Section 3.5). For Bragg incidence (8 = 0) 
and unslanted fringe-planes (6 = 0) we can obtain from equation (55) 
a formula for the signal amplitude of mixed gratings, which is 


d 
COS 8 





S = —jx i {a + (” + a)? coth (° + ay) (70) 


where «x is of complex value, a is the average absorption constant, d the 
grating thickness and 6) the angle of incidence. 


V. AMPLITUDES OF THE DIRECT WAVES 


For diagnostic purposes it is often of interest to monitor the change 
in amplitude of the direct reference wave R, which is depleted because of 
diffraction into S and absorption. The quantities of interest are the 
amplitudes 2(d) which can be obtained from the analysis of Section 2.2. 
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Fig. 17— Reflection holograms—the angular and wavelength sensitivity of an 
absorption grating for a: = a (Di = Do) and values of D; = 2» = Dy = 
(no = 0.007), Di = 1 (no = 0.05), and Di; = 2 (no = 0.068). 
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Fig. 18—Reflection holograms—the efficiency of an absorption grating as a 
function of slant for a1 = a (D; = Do). ¢ = cr/cg is the slant factor. 


We will give here the general results for transmission and reflection 
holograms. The notation is that of Section 2. 


5.1 Transmission Holograms 


From equations (27) and (83) we get for the constants 7; of equation 
(25) the expressions 


ll 


rT 


—K'/es(y1 — ¥2)(Crv1 + @) (71) 


I 


K/esly1 — Y2)(Cr¥2 + @). 
Using this we can write the output amplitude R(d) of the reference 
wave in the form 


_ «(exp (2d) _ exp od), 
Ma) 7 Cs(V1 —_ 2) (2 +a Cr¥i +a v2) 


Te 


5.2 Reflection Holograms 


For reflection holograms we use equations (27), (87), and (389) to 
express the constants 7; in the form 


Ty = (Csy + @ + 78) exp (y2d)/{exp (y2d)(@ + 90 + €571) 

— exp (yid)(@ + 98 + Cg72)} (73) 
T2 = — (Csy2 + a + 78) exp (rid)/{exp (y2d)(@ + 7 + e511) 

— exp (id)(@ + 78 + Cs¥2)}. 
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The output amplitude R(d) of the reference wave becomes 
R@) = es(vi — Y2)/{(@ + JF + sy) exp (—Vid) 
— (a + jd + Cs72) exp (—72d)}. (74) 


More detailed evaluations of the above formulas should follow the 
pattern prescribed in Sections III and IV. They can be undertaken 
for the specific case when the need arises. 


VI. VALIDITY OF THE THEORY 


We have tried to make our results as generally applicable as possible. 
We have allowed for the presence of absorption in the various hologram 
gratings and for a slant of the fringe planes. But a whole range of assump- 
tions had to be made to make the simple coupled wave analysis possible. 
It seems appropriate to recount these assumptions to make clear the 
region of validity of the coupled wave theory. We have assumed that: 


(z) The electric field of the light is polarized perpendicular to the 
plane of incidence. However, the appendix gives an extension of the 
theory to allow also for light of parallel polarization. 


(iz) A slant of the fringe planes with respect to the z-axis is allowed, 
except that these planes are perpendicular to the plane of incidence. 
(This is reflected in the assumption (a, z), c(a, z).) But this assumption 
is not made in the generalization which we have given in the appendix. 

(112) The spatial modulation of the refractive index and the absorption 
constant is sinusoidal. 

(iv) There is a small absorption loss per wavelength and a slow 
energy interchange (per wavelength) between the two coupled waves. 
This condition is stated mathematically in equation (7) and justifies 
neglecting the second derivatives #’”’ and S” in the analysis. 

(v) There is the same average refractive index n for the regions 
inside and outside the grating boundaries. If the grating has interfaces 
with air, then Snell’s law has to be used to correct for the angular 
changes resulting from refraction. 

(vi) Light incidence is at or near the Bragg angle and only the diffrac- 
tion orders which obey the Bragg condition at least approximately are 
retained in the analysis. The other diffraction orders are neglected. 


A detailed mathematical justification of assumption v7 is outside the 
scope of our simple analysis. One can advance physical arguments to 
show that this step limits the validity of the theory to “thick” gratings, 
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where the phase synchronism between the two coupled waves has 
enough time to develop a strong and dominating effect. Better definitions 
of a “thick grating’? must come from more accurate theories which are 
available for special cases. A large amount of work has been done on 
acoustic diffraction gratings which correspond to the case of our 
unslanted, lossless, dielectric transmission-hologram gratings.°° In 
acoustic diffraction one defines the parameter 


Q = 2rdd/nA? (75) 


as an appropriate measure of grating thickness. We can regard a grating 
as thick when the condition Q > 1 holds.”’'”* It appears that the coupled 
wave theory begins to give good results for values of Q = 10. This is 
particularly well demonstrated by Klein and his associates in theoretical 
and experimental work on acoustic gratings for the predictions of both 
the peak efficiencies and the angular sensitivities.°’’"** We hasten to 
add that for the majority of practical holograms the parameter Q is 
larger, and sometimes much larger, than 10. 

Further checks of the validity of the coupled wave theory are provided 
by comparisons with accurate computer calculations and with experi- 
ments on special examples of gratings. Burckhardt has made computer 
calculations on unslanted, lossless, dielectric transmission holograms 
for selected values of grating parameters which are commonly encoun- 
tered in holography.*’’? Comparison with the results of the coupled 
wave theory shows very satisfactory agreement.** Measurements by 
Shankoff and Lin on dielectric transmission holograms prepared with 
dichromated gelatin yielded diffraction efficiencies approaching 100 
percent, which agrees with the theory (even though there may be some 
uncertainty as to the exact nature of the refractive index variations).*°"”* 

Efficiency measurements on thick absorption gratings for the case of 
transmission holograms were made by George, Mathews, and Latta.°°"*’ 
Efficiencies approaching our predicted maximum value of 3.7 percent 
were observed. 

KKiemle has studied unslanted (@ = 0) reflection holograms for the 
special case of normal incidence (6. = 0) by analyzing equivalent 
four-terminal networks.** His treatment of absorption gratings cor- 
responds to the material we discussed in Section 4.8 specialized to the 
case of 6 = 0. But Kiemle’s value of 2.8 percent for the maximum 
diffraction efficiency of absorptive reflection holograms does not agree 
with our prediction of 7.2 percent. This disagreement appears to derive 
from a set of restrictive assumptions made in Kiemle’s work. Experi- 
mental observations on absorptive reflection holograms were made by 
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Lin and Lo Bianco.’ Efficiency values as high as 3.8 percent were 
measured, which seems to support the predictions of the coupled wave 
theory. But further experiments are needed for a good confirmation. 


VII. CONCLUSIONS 


We have discussed a coupled-wave analysis of the Bragg diffraction 
of light by thick hologram gratings. This approach made it possible to 
derive simple algebraic formulas for the behavior of various types of 
holograms, even for the case of high diffraction efficiencies where the 
incident wave is strongly depleted. The treatment covers transmission 
holograms and reflection holograms, and it includes the spatial modula- 
tions of both the refractive index and the absorption constant. The 
influence of loss in the grating and of slanted fringes is also discussed. 
Formulas and their numerical evaluations are given for the diffraction 
efficiencies and the angular and wavelength sensitivities of various 
grating types. 

For special cases we can compare the results of this theory with more 
accurate computations and with experimental observations. These 
comparisons give us the confidence to assume that the coupled wave 
predictions are good for a broad range of practical hologram types. 
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APPENDIX 


Reduced Coupling for Light Polarized in the Plane of Incidence 


In the body of this paper it was assumed that the incident light is 
polarized perpendicular to the plane of incidence. The purpose of this 
appendix is to show that we can use the results of the main paper also 
when the light is polarized in the plane of incidence, provided that we 
modify the coupling constant x. Such a modification is suggested already 
by the dynamical theory of X-ray diffraction. 

As in Section II we start with the wave equation 


VE — V(V-E) + hE =0 (76) 


for the electric field in the grating. Here, in contrast to equation (1), 
we have described the field by the vector quantity E and have included 
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the term V(V-E), which is not necessarily zero. The constant k’ is 
defined in equation (4). As in the main paper, we assume that only two 
waves are present in the grating, and put 


E = Rize 7?* + See i** (77) 


using the vectors R and S to describe the amplitudes of the reference 
and signal waves. 9 and 6 are the propagation vectors (as in Section IT) 
which point in the direction of the wavenormals. They are related by 
equation (11). In addition we assume that, both, R and S are transverse 
waves, that is, that the following conditions hold 


(6-S) = 0. 
Combining equations (76), (77), and (78) we get, after separating 
terms with equal exponentials and neglecting second derivatives 07/dz” 
—2jp.R’ + joh? — 2jaBR + 2xBS = 0 (79) 
—2jo,S’ + j6S! + (8? — o”? — 2jaB)S + 2xBR = 0 (80) 


where R, and S, are the z-components of #& and S, and the notation of 


Section IT is used. 
We now make the additional assumption that the polarizations of 


R and S do not change in the grating and write 
R@) = R@)r, 
S(z) = S¢@)s, 


(81) 


where R(z) and S(z) are the scalar amplitudes of the two waves, and 
r and s are polarization vectors independent of z. These vectors are 
normalized so that 


(r-r) = 1, (s-s) = 1. (82) 
Because of (78) we have 
(r-e) = 0, (s-6) = 0. (83) 


After forming the dot products of r with eq. (79) and of s with (80) 
we use equations (81), (82), and (83) to arrive at 


—2jp,R’ — 2jaBR + 2xBS(r-s) = 0 (84) 
—2jc,8’ + (8? — o — 2jaB)S + 2xBR(r-s) = 0. (85) 
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As in Section II, we introduce the abbreviations 


Ce = p./B, es = @./B, (86) 
and 
b= (6° — o*)/28, (87) 
which allow us to write the above equations in the form 
crR’ + ok = —jx(r-s)S (88) 
ésS’ + a@ + 7O)S = —je(r-s)R. (89) 


These are coupled wave equations which govern the Bragg diffraction 
of light polarized parallel to the plane of incidence, and indeed, of light 
of arbitrary polarization. They are similar in form to the coupled wave 
equations (21) and (22) which were derived for perpendicular polariza- 
tion. The only difference is a reduction of the effective coupling constant 
by the dot product (r-s) of the two polarization vectors. 

Referring to the grating geometry of Fig. 1 we have (r-s) = 1 for 
light polarized perpendicular to the plane of incidence. For parallel 
polarization the value of this dot-product depends on the inclination 
angles, and we have a reduced effective coupling constant x, given by 


Ky = x(t-S) = —x cos 2(4) — @). (90) 


We can apply the results of the main paper for parallel polarization if 
we replace x by x,. For this polarization there is the trivial case of a 
Bragg angle of 45° (that is, diffraction angles of 90°) where (r-s) = 0 
and the intensity of the diffracted light goes to zero. 
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Statistics on Attenuation of Microwaves 
by Intense Rain 


By D. C. HOGG 


(Manuscript received June 12, 1969) 


Heavy rainfall and associated attenuation at centimeter and millimeter 
wavelengths are discussed. Measured attenuations are combined with path- 
rainfall statistics obtained from a rain-gauge network to produce plots of 
attenuation versus path length for a given probability of fading. Under the 
assumption that the spatial behavior of heavy rain is similar at various loca- 
tzons, the path-average rainfall statistics are combined with highly resolved 
point rain rates for geographically separated places to produce attenuation 
data appropriate to those places. Dual parallel-path-diversity 1s also 
evaluated; 1t 1s shown to be a very advantageous arrangement. 


I. INTRODUCTION 


An important problem in designing wide-band radio-relay systems at 
frequencies exceeding 10 GHz is reliability. Propagation through heavy 
rain is the significant factor in determining realiability of the medium. 
Thus it is important to examine the spatial and temporal behavior of 
heavy rain and the resultant attenuation. 

Recent measurements of progagation at 18.5 GHz and 30.9 GHz, and 
analysis of rainfall data from the Crawford Hill rain-gauge network of 
Bell Telephone Laboratories at Holmdel, New Jersey, have led to an 
improved understanding of the rain environment.'* Those data are 
used here to provide information on attenuation by rain for use in sys- 
tem design. In particular, the improvement in performance obtained by 
use of path diversity is evaluated.’ 


Il. SINGLE-PATH STATISTICS (NEW JERSEY) 


2.1 The Magnitude of the Attenuation 


First, one must ask: What is the magnitude of the attenuation caused 
by heavy rain at frequencies exceeding 10 GHz? Figure 1 is a plot of 
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Fig. 1— Attenuation measured during rain of rate 100 mm/hr (averaged over a 
1 km path). The measurements at 8 and 15 GHz are from Ref. 6; 11 GHz from Ref. 7; 
18 and 30 GHz from Refs. 1 and 2; and 50 and 70 GHz from Ref. 9. A indicates 
Bell Telephone Laboratories data and © indicates DRB 1966 data—Canada (some 
extrapolation for both). 


attenuation measured at a rain rate of 100 mm/hr (4 inches per hour) 
for a path length of 1 km. Data measured at 100 mm/hr, rather than at 
low rain rates, are used because path-average rain rates of this magni- 
tude do indeed occur a significant percentage of the time in many 
places, including New Jersey. 

Moreover, in the discussion that follows, we are concerned with at- 
tenuations caused by path-average rain rates of the order of 100 mm/hr, 
and the attenuations will be taken to be directly proportional to the path 
average rain rates; that is, proportional to the average density of rain 
along the path.* The curve in Tig. 1 serves as a benchmark by means 
of which attenuation is related to heavy path-average rain rates. Thus 


* From theoretical considerations, the attenuation y, at frequencies of the order 
10 GHz is believed related to the rain rate R by y = af. a is a function of frequency 
as indicated and @ is also a mild function of frequency with values near unity. Here 
we use values of a measured at high rain rates (since 8 is taken to be unity) to 
minimize errors in the event 8 departs from unity. 
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for heavy rains one has: 
y = 0.04Rd; y = 0.1Rd; y = 0.2Rd 


for frequencies of 11, 18, and 30 GHz, y being the attenuation in deci- 
bels and R the average rain rate on a path of length d. 


2.2 Dependence of Path-Average Rain Rate on Path Length 


The path analysis of rain rate discussed in Ref. 3 encompasses the 
heavy rains of 1967 taken on 100 rain gauges forming a 130(km)’ grid 
in New Jersey. Obviously, there are many paths of various lengths in 
such a network and a relatively large amount of data is obtained for 
such paths from the several storms that occur during one year. Path 
average rain rates have been converted to a yearly base and are plotted 
in Fig. 2;* the curves show the probability of path-average rain rate 
with path length as parameter. At rain rates of the order 50 mm/hr, the 
probabilities are about the same for all path lengths, namely, about 
0.01 percent; thus the probability of exceeding an average rate of 50 
mm/hr on a 10.4 km path is about the same as at a point (path length— 
zero in Fig. 2). As the rain rate increases, the curves diverge. For ex- 
ample, the probability of a 100 mm/hr rain rate on a 10.4 km path is less 
by a factor of ten than that for a point; at 150 mm/hr, the factor is 
one hundred. 


The data in Fig. 2 can be examined in another way. Consider a given 
probability, say, 0.001 percent (five minutes per year); the correspond- 
ing rain rate at a point is about 160 mm/hr, whereas for a 10.4 km path 
it is 80 mm/hr. This behavior tells us that heavy rains occur as localized 
showers. Of course, this behavior will show up in evaluating the at- 
tenuation on paths of various lengths. 


2.3 Dependence of Attenuation on Path Length 


The relationship between attenuation and path-average rain rate 
at various frequencies as given in Fig. 1, and the probability of occur- 
rence of rain rates, Fig. 2, have been used to produce Figs. 3a and b. 
Two probability levels (0.01 percent, 50 min/yr, and 0.001 percent, 
5 min/yr) and three frequencies (11, 18, and 30 GHz) have been chosen 
as representative of radio relay. These plots give computed attenuation 
that is exceeded for the percent of time indicated on the figure as a 
function of path length. Note that there is curvature in the plots. As 
one would expect, having looked at Fig. 2, the attenuation one obtains 


* From curves A in Fig. 28 of Ref. 3. 
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Fig. 2— Probability of path-average rain rates for paths of various lengths; 1967 
rain-gauge network data. 


(for a low probability) on a 10 km path is less than one would expect by 
linearly extrapolating from the attenuation on a 1 km path. 

The two propagation paths in operation within the Holmdel rain- 
gauge network are 1.9 and 6.4 km long at frequencies 30.9 and 18.5 
GHz, respectively;’’? these lengths are indicated by arrows on the 
abscissas in Fig. 3.* Percent of time distributions of attenuation on 
these paths were measured throughout 1967 and 1968 and points taken 
at the indicated probability level are shown on the figures. For the 1.9 
km path, the measured 30.9 GHz attenuations agree well with the 
computed curves for 830 GHz: somewhat higher in Fig. 3a and slightly 
Jower in Fig. 3b. 

Likewise, in Fig. 3a the points measured at 18.5 GHz (6.4 km) are 
in good agreement with, but are somewhat lower than, the computed 
curve for 18 GHz. In Fig. 3b the 18.5 GHz measurement for 1968 is 
somewhat below the computed curve; however, the 1967 measurement 


* The 18.5 GHz signal is vertically polarized and the 30.9 GHz signal is polarized 
45° from vertical. 
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Fig. 3— Attenuation as a function of path length at 11, 18 and 30 GHz (1967 
network data) for (a) 0.01 percent probability (50 min/yr) and (b) 0.001 percent 
probability (5 min/yr) along with measurements on paths of length 1.9 km (30.9 
GHz) and 6.4 km (18.5 GHz). 
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is considerably lower. Comparison of the 18.5 GHz attenuation dis- 
tributions for 1967 and 1968 shows that heavy showers were more fre- 
quent on this path in 1968 than in 1967. The 18 GHz curves in Figs. 
3a and b are apparently somewhat conservative. 

Thus use of the attenuations in Fig. 1 to convert the pool of path- 
average rain rates from the rain-gauge network has led to a set of curves 
of attenuation versus path length that are consistent with independent 
measurements of attenuation. Accordingly, for design of a conventional 
tandem relay system at, say, 18 GHz, with a 30 dB margin, the repeater 
spacing is, from Tig. 3b, 2.5 km for 0.001 per cent probability on in- 
dividual paths in coastal New Jersey. 


III. SINGLE-PATH STATISTICS (OTHER LOCATIONS) 


It is tempting to ask if the knowledge gained from the above studies 
can be used to say something about the attenuation environment in 
places other than coastal New Jersey. If certain assumptions are made 
concerning the spatial distribution in rain showers, that can be done. 


3.1 Point Rainfall Rates of High Resolution 


Distribution of point rain rates with high resolution have been meas- 
ured in a few places, shown in Fig. 4. Four of the full curves were meas- 
ured in the United states by the Illinois State Water Survey using a 
photographic method measured over the best part of a year; they form 
a consistent set of data.° This method is capable of measuring drops in 
a small volume during a short interval every ten seconds. The solid line 
for Bedford, England is from a four-year sample;’ gauges with two- 
minute resolution were used. The dashed curve is the distribution for the 
pool of data taken during 1967 on the rain-gauge network at Holmdel, 
New Jersey;> gauges with a time constant less than one second were 
sampled every ten seconds. 

For a given probability of occurrence, how much heavier does it 
rain at other locations than in New Jersey? Table I shows the point 
rain-rate intensity in other places relative to New Jersey for the 0.01 
and 0.001 per cent levels; the Illinois state survey set of curves and the 
data from England in Fig. 4 are used in this comparison. 

Thus in the regime of low probability (high rain rate), the rain inten- 
sity in New Jersey is about one quarter that of Miami, Florida, and 
five times that of Corvallis, Oregon. These data must now be linked with 
the spatial distributions obtained in New Jersey in order to determine 
the attenuations. 
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Fig. 4— Point rainfall rates measured in several places by instruments with 


rapid response. 


3.2 The Spatial Distribution of Rain Showers 


The data in Fig. 2 show that the probability of a given path-average 
rain rate decreases with increasing path length for heavy rains, a not too 
surprising result since one is dealing with rain cells of limited size. Like- 
wise, for a given probability level, the path-average rain rate decreases 
with increasing path length as shown in Fig. 5. For relatively high prob- 
ability (107*), this decrease does not amount to much; as shown by the 
lowest curve in Fig. 5, the average rain rate for a 10 km path is about the 
same as that for a point (d = 0). However, for example, on the upper- 


TABLE I—RELATIVE INTENSITY OF Point RAIN RATES 














Miami Coweeta Island Beach Bedford Corvallis 
Probability Level Florida North Carolina New Jersey England Oregon 
(a) 10-4 5 1.75 1 0.48 0.25 
(50 min/yr) 
(b) 1075 (5 min/yr) 3.5 1.55 1 0.42 0.15 
AVERAGE of 4.2 1.65 1 0.45 0.2 


(a) & (b) 
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Fig. 5— Average path rain rate in New Jersey versus path length for probability 
levels 10-4, 10-5, and 1078. 


most curve in Fig. 2 (for 10~° probability) the average rain rate on a 
10 km path is only one half the rate at a point. 

Assume that the spatial behavior of heavy rainfall is the same in 
other places as it is in New Jersey. This means that in a place with 
relatively low point rain rates (such as Oregon, Fig. 4), path-average 
rates are about the same as point rates (such as in the lowest curve in 
Jig. 5), that is, large-area rain. Whereas, where the point rain rates 
are very high (such as in Florida, Fig. 4), the path-average rates are 
much less than the point rates (such as in the uppermost curve in Fig. 
5), that is, showers. To determine whether this assumption is warranted, 
one must await spatial measurements of rain rate in other places. 

The data in Figs. 2 and 4 are used to construct Table II, a list of path- 
average rain rates for the various locations, as a function of path length, 
d. Some extrapolation of the curves in Fig. 2 was necessary to obtain the 
column for Miami, Florida. 

Table IT has been converted to attenuation at 18 GHz by way of the 
relationship discussed above as shown in Fig. 6. As one would expect, 
the attenuation for Oregon is linear with path length, whereas for places 
with heavy rain, there is considerable curvature. Figure 6 tells us that 
a single transmission path at 18 GHz with a 30 dB fading margin should 
not exceed 1, 2, 3, 6, and 15 km in Florida, North Carolina, New Jersey, 
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TasLe IJ—Patru—-AVERAGE Rain RATES IN MM/HR FOR THE 107° 
PROBABILITY LEVEL 


Corvallis Bedford Island Beach Coweeta Miami 
d-km Oregon England New Jersey | North Carolina Florida 
0 20 55 130 200 450 
1.3 20 53 110 165 325 
2.6 20 52 103 153 265 
5.2 20 50 90 135 215 
7.8 20 48 75 110 210 

10.4 20 45 70 95 


Bedfordshire-England*, and Oregon, respectively, if a probability of 
10~° is stipulated. 

One might argue that in Florida (for example) where the water vapor 
available for production of rain exceeds that of New Jersey, the dimen- 
sion of a rain cell of given rain rate may exceed that of a cell of the same 
rain rate in New Jersey. If this were true, the attenuation for Florida 
and North Carolina in Fig. 6 would be somewhat higher than shown. 


IV. PATH DIVERSITY (NEW JERSEY) 


The analysis of the rain-gauge network data by Freeny and Gabbe’ 
encompasses not only single paths of various lengths but also joint 
statistics for pairs of parallel paths separated by various distances.t 
These data are applicable to the design of path-diversity systems in that 
they are statistics of the percentage of time that the average rain rate 
on both paths exceeds given values. Of course, the idea in path diversity 
is to switch to the path with lowest attenuation.’ 


4.1 Two Parallel Paths with a Given Separation 


An example of how path-average rain rates in the diversity arrange- 
ment convert to attenuation is given in Figs. 7a, b, and ¢ for frequencies 
of 11, 18, and 30 GHz. The curves apply to the 0.001 percent probability 
level (5 min/yr) and a diversity separation of 5.2 km (8.25 miles). For 
comparison, the attenuation for a single path is shown by a dashed line 


*In a recent Committee Consultatif Internationale Radio document (United 
Kingdom Document IX/164-E, May 9, 1969), attenuation distributions for a 24 km 
path, and for the worst year (1968) observed to date, indicate that the path length 
appropriate to 0.001 percent probability and 30 dB attenuation is something less 
than 12 km in Bedfordshire at 18 GHz. This presumably means that, even in a 
relatively low rain-rate environment, the heavier rains do indeed occur as showers 
of limited size (see also Ref. 5). That being the case, the curve for England in Fig. 6 
would have more curvature than indicated, that is, the curve in Fig. 6 would be 
quite conservative. : 

t See Fig. 28 of Ref. 3, 
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18 GHZ ATTENUATION BY RAIN IN DECIBELS 





Fig. 6—18 GHz attenuation versus path length for various places; probability 
isval” 10-5 (5 min/yr). 


on each figure. If transmission paths at 18 GHz with a 30 dB fading 
margin are considered, ig. 7b shows that the path length in the diversity 
arrangement with 5.2 km separation can be just over 5 km, compared 
with 2.5 km when no diversity is used. 


4.2 Relationship Between Interrepeater Path Length and Diversity Separa- 
tion 

A somewhat more general question is: For a given attenuation margin 
and a given probability level, how does the inter-repeater path Jength 
change with diversity separation? As an example, 18 GHz, 30 dB, and 
0.001 percent are chosen for the frequency, margin, and probability 
level; the data are plotted in Fig. 8. Note that the path length d for a 
diversity separation s of 7.5 km is 7 km, about thrice the path length 
(2.5 km), for the nondiversity arrangement (d = 0). Results such as 
these have considerable economic implications. The data can also be 
plotted as in Fig. 9 where 18 GHz attenuation is given as a function of 
path length with path separation a parameter. Apparently, for a given 
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Fig. 7— Attenuation appropriate to a dual parallel-path-diversity separation of 
5.2 km as a function of path length for frequencies (a) 11 GHz, (b) 18 GHz, and 
(c) 30 GHz. The dashed curves are for conventional (nondiversity) paths. 
(Attenuation exceeded 0.001 percent of the time jointly for two parallel paths 
spaced 5.2 km apart.) 
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Fig. 8— Path length as a function of dual parallel-path-diversity separation for 
a 0.001 percent probability level (5 min/yr) at 18 GHz with a 30 dB attenuation 
margin; 1967 network data. 
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Fig. 9—18 GHz attenuation appropriate to 0.001 percent probability versus 
path length for various diversity separations; 1967 network data. 
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diversity separation, the advantage of diversity over nondiversity is not 
a strong function of the fading margin. 

As yet we have no actual attenuation measurements on path diversity. 
However, the data of Figs. 8 and 9 are believed conservative in the same 
sense as those of Fig. 3; of course, they apply only to coastal New Jersey. 


V. DISCUSSION 


Although the data given here result in well-resolved design curves 
hopefully useful in design of radio systems, at least two important 
questions remain. A system is comprised of many paths in tandem 
forming a route of length J, whereas here only single paths have been 
discussed. If one has 7 such paths in tandem, is the probability P, of 
attenuation by rain on the system simply nP, where P, is the proba- 
bility for a single path? In other words, is there no correlation between 
heavy fades on tandem paths? Obviously, if a dense rain cell were cen- 
tered on a repeater, there would be correlation of attenuation on the 
two paths associated with that repeater. From such considerations and 
examination of rainfall data, the relationship P; = nP, is believed too 
conservative. 

The other question is related to path diversity. We have only dis- 
cussed the case of two (single) parallel paths separated by various 
distances. But in an actual system one deals with several paths in tandem 
on each leg of the route; these two legs must of course merge if one wishes 
to switch from one to the other. The path lengths for merge points lie 
between those given in Fig. 3 and those appropriate to a parallel path 
diversity arrangement.’ Moreover, the diversity analysis here deals 
with two (single) parallel paths of given separation whereas in practice 
one would be dealing with a line of tandem paths parallel to, and dis- 
placed from, a second such set. In that case, the advantage gained by 
path diversity must be investigated beyond what we have done here. 

Finally, it should be pointed out that the microwave systems to 
which the above discussion is pertinent would carry very wide bands 
of information. Clearly, the advantages of dual paths in providing 
equipment diversity (in addition to propagation reliability) would be 
considerable in such systems, especially from the viewpoint of main- 
tenance. 
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Work-Scheduling Algorithms: A 
Nonprobabilistic Queuing Study 


(with Possible Application 
to No. 1 ESS) 


By JOSEPH B. KRUSKAL 
(Manuscript received November 7, 1968) 


In many large computer systems with real-time use (such as the No. 1 
Electronic Switching System), the central processing unit handles much 
of its work through queues. It may spend much of its time cycling through 
the queues, performing the work requests it finds there. To accomodate 
varying degrees of urgency, the cycle may visit some hoppers more often 
than others. (No. 1 ESS strongly relies on this procedure.) This paper 
provides an approximate method for evaluating different cycles. 

Using the evaluation method and some approximations, we obtain a 
formula for the optimum relative frequency with which different queues 
should be visited. 

The model used 1s nonprobabilistic, and treats requests as continuous 
rather than discrete. The model also ignores certain interdependencies 
between queues. Despite these drastic simplifications, the results probably 
provide useful guidance, if interpreted cautiously. 


I. INTRODUCTION 


In many large computer systems, especially those with real-time 
use, the central processor handles much of its work through queues, 
which contain work requests. (The queues may also be called hoppers, 
buffers, waiting lines, files, and so forth. In this paper we call them 
hoppers.) The processor examines each hopper in turn, and performs 
some or all of the work requests if any, which it finds there. 

Some work requests require processing more urgently than others. 
One method of providing appropriate response times is to examine 
more frequently hoppers which contain urgent work, and other hoppers 
less frequently. For example, the No. 1 ESS (Electronic Switching 
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System) has many hoppers which it groups into five different urgency 
classes.*'” The five classes are examined (or ‘‘visited’’) in a fixed re- 
curring cycle, of length 30, during which the classes are visited 15, 
8, 4, 2, and 1 times, respectively. (During a single visit to a single class, 
the individual hoppers are visited once each, in a fixed sequence.) 

This paper contains a practical approximate model for evaluating 
various alternative cycles. The conceptual basis for the evaluation 
is the expected time each work request must wait in the hopper before 
being serviced by the central processor. (Such times depend not only 
on the cycle, but also on the times required to process requests, and 
on the rates at which new requests are initiated. These are all assumed 
given.) The expected waiting times for different hoppers are multi- 
plied by frequencies and also by weights w; , the ‘average penalty 
per second of delay,” and added. The resulting sum is called P, the 
“expected total penalty per second.” The weights w; , which reflect 
the relative importance of delaying different work requests, are assumed 
given, and we seek to minimize P by choosing the cycle wisely. By 
way of illustration, the calculations required to evaluate any given 
cycle are given for two very simple cycles. 

When applied to general cycles, our model yields the plausible 
conclusion that visits to the same hopper should be spaced as evenly 
as possible around the cycle (in terms of elapsed time between visits). 
Furthermore, the model permits us to estimate how sensitive P is to 
deviations from this ideal. 

Our most important conclusion is an explicit formula for how fre- 
quently each hopper should be visited. To obtain this formula, we 
assume that visits to each hopper are evenly spaced around the cycle. 
Then P becomes a function of the visit frequency (and not of detailed 
visit pattern). We explicitly optimize this function, to obtain a formula 
for visit frequencies. 

The time required to examine a hopper, whether or not it contains 
any work requests, is small but highly significant, and is an important 
consideration in the problem. Our model explicitly reflects this fact. 
(Indeed, it is known though sometimes overlooked that the No. 1 ESS 
central processor finds most hoppers empty on a majority of its visits, 
even when it is heavily loaded with work and operating near its ca- 
pacity limits. This can occur because the number of hoppers is so large, 
and because each work request requires a relatively long time to service 
compared with the time to visit a single hopper.) 

In this study, we assume that work enters the hoppers as a result 
of some outside process, which is independent of how the hoppers 
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are being served. In No. 1 ESS, as in many other situations, much 
work does enter hoppers in this manner. However, it is also true that 
servicing a request from one hopper may place work, directly or in- 
directly, in another hopper. This interdependence may well be important 
in choice of a eycle. Nevertheless, the present model, which ignores 
such interdependence, is probably usable if we are suitably cautious 
about interpreting our results. 

Service requests are discrete items and enter the hoppers according 
to an exceedingly complicated random process. Our model, however, 
assumes that each kind of request comes in at a constant rate, with 
no statistical fluctuation whatsoever. Furthermore, we treat the number 
of requests as a continuous quantity (so that requests keep trickling 
in like water) rather than a discrete quantity. 

Despite the drastic nature of all these simplifications, we believe 
that this analysis is better than no analysis at all. Furthermore, we 
feel that our conclusions are probably valid approximations. It also 
seems plausible that our model could provide the jumping-off place 
for a more realistic study. Both interdependence and statistical fluc- 
tuation could be introduced in a limited way. (Since this was first 
written, R. W. Landgraff has done a study which extends this model 
to include interdependence.’*) This might well permit their main effects 
to enter the model, without opening the Pandora’s box of an ex- 
tremely general stochastic process with one server and many inter- 
dependent queues. 


II, SOME ASSUMPTIONS AND NOTATION 


We suppose that there are J hoppers. For each hopper 7 we assume 
that we have three parameters: 


8; = service time = average time to service one request in this 
hopper, 

r; = request time = average time between occurrence of requests 
> s;, and 

w,; = weight = average penalty per second of delay for a single 


request of type 7. 
We also use 


M=SKL A= DX. 


t=1 


(To permit a steady-state solution, we assume A < 1.) Note that the 
definition of w; implies that on the average the penalty for delaying 
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one kind of task is proportional to the delay time. The w; are the pro- 
portionality constants. This simple assumption could be refined some- 
what without too much trouble if desired. 

In No. 1 ESS, one major penalty caused by hopper delays is the 
extra waiting time they cause to the telephone user at various stages 
of his call. For some hoppers, such as those involved during the process 
of dialing, undue delays can cause mishandling of the call. (Also, the 
delays tie up memory capacity and indirectly cause a need for extra 
memory equipment. However, this effect is probably minor.) By 
considering the loss incurred by the user due to various waiting periods, 
and the loss due to the probability of mishandled call, it would be 
possible to assign sensible values to the w; . Although a truly realistic 
appraisal of the losses would require a quite elaborate study, some 
fairly reasonable simplifying assumptions which would make this 
study much simpler are available. Furthermore, assignment of the 
w; on a direct intuitive basis would probably be adequate for many 
purposes. 

To measure the total delay penalty paid by any work-scheduling 
algorithm, we combine the various penalties into a single number P: 


d; = expected delay for a request of type 7, 
p; = expected penalty per request of typez = w; d; , and 
P = expected total penalty per second 


I 
1 i 
= 2a BS Va. 


(Of course, 1/r; is the expected number of requests of type 7 in one 
second.) We seek to minimize P by proper choice of a work-scheduling 
algorithm. Only the delays d; may be influenced in this way, so we 
concentrate on evaluating the d; . 

A model which, like ours, treats requests as continuous has the 
danger of “discovering” that the hoppers are serviced infinitely fast, 
accumulating only an infinitesimal amount of work between visits. 
The following assumption, which in any case reflects an important 
reality, avoids this collapse. 

To examine the 7th hopper, whether or not it contains any work, 
requires a certain amount of time. We assume this amount of time 
is H; . For simplicity we shall assume all the H; are equal, and shall call 
their common value H, although it would be easy to work with unequal 
values if desired. Thus if x requests are serviced during one visit to 
hopper 2, this visit requires H; + xs; seconds, 
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It will turn out later that the value chosen for H is not very important 
in the context of this model. The comparison between different work- 
scheduling algorithms is unaffected by the (nonzero) value used. 


III. WORK-SCHEDULING AND SERVICE POLICY 


We suppose that the hoppers are visited in a fixed cycle of length 
N, namely, 


(1 , 22 , A , ty). 


This means that hopper 7, is visited first, then hopper 7, , and so on. 
After zy is visited, the cycle starts over again with hopper 7, . One simple 
cycle with J = 4and N = 6 is (1, 4, 2, 4, 3, 4). No. 1 ESS uses J = 5 
hoppers (classes of hoppers, actually), and a cycle of length N = 30: 
oe ee ee 
Ose esi 
1213121412131215121312141213i12 


If 2 is any given hopper, we shall let V(z) indicate the set of all visits 
to hopper 2. Thus for the cycle (1, 4, 2, 4, 3, 4), we have 


Vd) = [1], V(2) = [3], V@) = [5], and V(4) = [2, 4, 6). 
In the No. 1 ESS cycle, 
Vd) = G1, 3, 5, --- , 29], V(2) = [2, 6, 10, 14, 18, 22, 26, 30], 
V(3) = [4, 12, 20, 28], V(4) = [8, 24], V(5) = [16]. 


For any visit 7, the last previous visit to the same hopper is called 
b(n) (“b” for before). Thus in the cycle (J, 4, 2, 4, 3, 4), visit 6 is to 
hopper 4, and the last previous visit to the same hopper is on visit 4. 
Thus 6(6) = 4. Because ‘‘last previous” is understood in a cyclic sense, 
b(2) = 6. We have 


b(1) = 1, b(2) = 6, b(3) = 3, 
b(4) = 2, b(5) = 5, b(6) = 4. 


Whenever a hopper is visited, we suppose that all work requests 
there are serviced. However, during the period when the hopper is 
being serviced new requests can enter it. What about these requests 
which enter the hopper while it is being serviced? These can either be 
handled when they are reached during the same visit, which we call 
the “come-right-in” policy, or they can be left for the next visit to the 
hopper, which we call the “please-wait”’ policy. We shall treat both of 
these hopper service policies, because their solutions are very similar. 
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IV. HOW TO EVALUATE P 


As there is no statistical variation left in our model, it is easy to 
analyze. Let 


t, = time spent emptying the hopper 7, during visit n. 


Let C' be the time spent during an entire cycle, so that C consists of N 
hopper visits. Hopper visit n consists of time H to examine the hopper, 
and time ¢, to service it. Thus 


N N 
C= (8 +t4) =NH+ Dit. 
n=1 n=1 
Now consider the requests which are serviced during ¢, . Let 
T,, = the interval during which they enter hopper 7, . 


Recalling that b(n) is the last prior visit to hopper 7, , we see from 
Vig. 1 that 


Db A+) =m- w+ Y b, 
p=b(n) +1 n)+1 
“come-right-in;” 
Th = (1) 
n-1 n—-1 
a, (t, ci H) a [n a b(n)]H = ps ly ’ 


“‘please-wait.”’ 


Note that b(n) and the summation indices must be understood in a 
suitable ‘‘cyclic” sense, so that (for example) if b(n) = n, then n — b(n) 
means once around the cycle and hence equals N, not 0. Now it is easy 
to see that 


(the number of requests served during ¢,) = t,/s;, 


= (the number of requests initiated during T,,) = T,/r;, , 


Tn FOR COME-RIGHT-IN 
jos ~ 





HY tm (Hy tms emer ae eee are via wel oe 
aobth) Tn FOR PLEASE -WAIT 


Fig. 1 —Time flow diagram illustrating ‘“please-wait’’ and “come-right-in’’ policies. 
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sO 
tn = Nipl'n . (2) 


By using equation (2), we can eliminate either all 7’, or all ¢, from equa- 
tions (1). This will leave us with N linear equations in NV unknowns, 
which in fact turn out to be linearly independent. By solving these 
equations and using equation (2), we can find the 7’, and the é, , and 
from them all else will follow, as we show below. For convenient ref- 
erence, we state the equations after eliminating the {, : 
In-bm)JH+ >> X,,7,, “come-right-in;” 
T = p=b(n)+1 (3) 


n=1 
[In —bdn)]JH+ >> 4,7, , ‘“nlease-wait.”’ 

p=b(n) 

Recall the special cyclic interpretation of n — b(n) and the summations. 
It is worth digressing briefly to derive an explicit formula for C, 

and to show how the N equations (3) can be reduced to N — J equations 
in N — I unknowns by using it. It is easy to see that if we sum T’, over 
all visits to some particular hopper j, the result must equal C: 

T, = C for every j. (4) 

nin V(j) 

Now sum equation (8) over all n in V(j), and use equation (4) several 
times: 


hs = a {tm ae b(n)|H a Ds nell) ’ 
ninV(i) nin V(j) p=b(n) +1 
N 
C = NH + >; Nipl'p 


p=1 


= NH + ya > | 


pin V(7) 
I 
= NH + AC. 
This yields 


NH 
ane a (5) 
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Since C is now given directly in terms of known quantities, we can use 
equation (4) to solve for one 7, in terms of others. We can do this 
separately for each 7 = 1 to J, and thereby reduce the number of un- 
knowns and equations to N-I. 

Once we have the values of 7, (and hence of ¢,), we may easily 
evaluate e,, the average delay for requests serviced during visit n. 
(Each delay is reckoned from occurrence of request to when its proc- 
essing starts.) By elementary reasoning, we see that 


(7, — t,), ‘‘come-right-in,”’ 
(T,, + t,), “‘please-wait.” 


Cn 


(6) 


€n 


Of course T,,/r;, requests are serviced in visit n. Thus the average 
delay per request of type 7 is 

Lie 

ae 


d,; = nin Vi) Ti (7) 


cae 


ninvii) V3 


Using equations (6), (2), and (4), we get 





d; = u oa Tt? , “come-tight-in,” 
nin V(z) 
(8) 
d; = ‘as >, T2, “please-wait.” | 
nin V(i) 
Now let 
F, = T,/C = fraction of a cycle used by T, , 
so that 
F, = 1, all ¢. (9) 
nin V(t) 
Then 


11 —yv)C >> F., “come-right-in,” 
d; = nin V(t) (10) 
same, but with 1+; for 1 — 2,;, ‘‘please-wait.” 


Using equation (5) and the definition of P, we now easily find a 
formula for the penalty P, which is the key quantity we use to evaluate 
work-scheduling algorithms: 
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NH | WD gps i “ceome-right-in,”’ 
2(1 = A) ‘= T; (1 \:) wo, Pal / 


same, but with 1 +A; for 1 —4,, 


P= (11) 


1 — A is unaffected ‘“please-wait.”’ 


(However, note that the values of the F,, may differ for the two policies.) 
We note that the work-scheduling algorithm influences equation (11) 
in only two ways: through N, and through the fractions F,. From 
this formula we can evaluate and compare different work-scheduling 
algorithms. Also we can compare “‘come-right-in” with ‘“please-wait.” 


V. SOME EXAMPLES 


If there are J = 3 different hoppers, the simplest possible cycle is 
(1, 2, 3), for which N = 38. In this case we see trivially that F, = FP, = 
Fr, = 1, for either “come-right-in ’’or ‘please-wait.’”? Thus equation 
(10) for cycle (1, 2, 3) is: 


(f 3H Sw; 
a St 
p- pas or, 


~_ 


ee but with 1 + A; for 1 — A;, 


(1 — \,), “come-right-in,” 


1 — A is unaffected ‘“‘please-wait.”’ 


Given the three input parameters s;, r;, and w; for each hopper, 
this can be evaluated numerically. 

Now suppose we use the cycle (1, 2, 1, 3), for which N = 4, with 
the ‘‘come-right-in” policy. Then equations (8) for cycle (1, 2, 1, 3) 
become the following four equations: 


T, = 47; + oT. + 2H, 
TS yy Ay OT AT a 
Ts = NeD's + MT, + 2H, 
ToS GAT) eT A A, ae a 


I 


I 


However, taking C as known, and using equation (4) for cycle (1, 2, 1, 3) 
namely, 


T, +73; = C, T, = C, T, = C, 


we eliminate the unknowns T,, 7; , and 7',, leaving one equation in 
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one unknown, 7’, : 
T, = ME: + Ae ++ 2H. 
We find 


T, = 





Ta BH + NCI. 


Dividing by C, and using C = 4H/(1 — A) from equation (5), we see 





As F, = 1 — F,, we find 
Ee ral (M=3)"] 
fo fas 1+ fia ie o) 


Fe=1, Fe=1. 


Thus equation (11) for cycle (1, 2, 1, 3) with the ‘“‘come-right-in”’ 
policy is 


AH Ww, 1 Ne — AsV’ 
p= ft mag [14 (BS) | 


Wo Ws = ; 
So T> (i a Ao) i Ts @ »)} 


and also 


Through special circumstances which would not hold in general, the 
values for F, using this cycle are all the same for ‘‘please-wait’’ as 
for ‘“‘come-right-in,” so P for ‘‘please-wait’’ is the same as the above 
but with 1 + \,; substituted for 1 + A; in three places. Given the 
parameters s;, r;, and w; for each hopper, this can be evaluated 
numerically. 


VI. CONCLUSIONS 


If we compare cycles of the same length and with the same number 
of visits to each hopper, then equation (11) yields the following con- 
clusion: The visits to a given hopper should be spaced as evenly around 
the cycle as possible. 
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By this we mean that the values of 7, (and hence of F,,) pertaining 
to this hopper should be as equal as possible. This follows because 
the minimum of 

> F subjectto >> F,=1 
nin V(t) nin V(t) 
occurs when the F,, with n in V(2) are all equal. Furthermore, equation 
(11) can be used to estimate how serious any given deviation from 
equality is. 

Suppose a cycle has N; visits to hopper 7, so that N = ZN; , and 
suppose that the N; visits are spaced approximately evenly around 
the cycle for every 7. Then for each visit » to hopper 7, 





1 
ae nN N; 
Thus 
me _ HN Wi = . 7 owe ne; 
PR sa — x 5 (1 — 2, Wi? “come right-in. 


Hither using a Lagrange multiplier to handle the constraint that 
ZN; = N, or by direct argument (see the appendix), it is easy to deduce 
that the values of NV; which minimize this satisfy 


1/2 
N; proportional to E ad mo | 


__ Lae)" 


so 


= 





- Se E (1 — w»| 


27 


This yields our most important conclusion: The above approximate 
formula gives the optimum relative frequency of visits to each hopper in 
the cycle. 


By obtaining values for 7; , s; , and less easily for w; , it is possible 
to compare different work-scheduling algorithms with each other and 
with the “‘ideal’”’ schedule with perfect spacing implied above. Notice 
that the actual value of H does not enter into this comparison. (If we 
had used unequal values for the H;, only the ratios H,;/H,; would 
enter into the comparison, not the actual values of the H; themselves.) 
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It would probably be worthwhile to analyze the actual work-sched- 
uling algorithm used for ESS No. 1 in these terms. It would be in- 
teresting to compare this actual algorithm with the ‘‘ideal”’ algorithm. 

Our model, with its highly simplified assumptions, cannot possibly 
provide the last word on work-scheduling evaluations, even with regard 
to delay times. However, this kind of approach is probably desirable. 
If greater realism is desired, the most important aspects are statistical 
variability and interdependence of hoppers. 


APPENDIX 


Direct Argument to Replace the LaGrange Multiplier Argument 


Henry Pollak has pointed out a simple direct argument which shows 
that Z(a;/N;) is minimized, subject to the constraint 2N; = N, if 
N; is proportional to (a,)?. Using a; = w,(1 — d,)/r;, this yields the 
formula given above for N; . 

First, let g = N/[2(a,)*]. Now, we multiply the quantity to be min- 
imized by q’, and express it: 


qa; a. 1/2 2 
z ‘V. = 2(o(%) ae vy”) + 2q2(a,)'? — ZN; . 


The middle term is constant by definition, and the last term is con- 
stant by constraint. The first term cannot be less than 0. The first 
term is 0 if 

qa: 


—_—_—_ = . _— . 1/2 
N, N,; or N; = q(a,)’. 


Since these values satisfy the constraint, we obtain the desired result. 
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Some Properties of a Nonlinear Model of 
a System for Synchronizing Digital 
Transmission Networks* 


By IRWIN W. SANDBERG 
(Manuscript received April 22, 1969) 


J.R. Pierce has recently proposed a system for synchronizing an arbitrary 
number of geographically separated oscillators, and, under the assumption 
of zero transmission delays between stations, has shown that a certain linear 
model of the system is stable in the sense that all of the station frequencies 
approach a common final value ast > ©. 

The purpose of this paper is to report on some results concerning the 
dynamic behavior of a nonlinear version of an important special case of 
Pierce’s model. The nonlinear model takes into account transmission delays. 

It is proved under certain very general conditions that the nonlinear model 
possesses the stability property required of a synchronization system. More 
explicitly, it 1s proved that the model ts stable for all nonnegative values of 
the delays. The results show that the model possesses some additional funda- 
mental properties of engineering interest, and they provide an analytical 
basis for using a computer for further studies. In particular, a complete 
solution to the problem of determining the final frequency of the system and 
the final value of the content of an arbitrary buffer is presented, in the sense 
that it is shown that these quantities can be determined by solving a certain set 
of nonlinear equations which is proved to possess a unique solution. 


I. INTRODUCTION 


The purpose of this paper is to report on some results concerning 
properties of the solution f,(¢), f.(t), --- , f.(@) of the set of equations 


* This paper was presented as an invited contribution at both the Symposium on 
Mathematical Aspects of Electrical Networks (sponsored by the American Mathe- 
matical Society, New York City, April 1969) and the Joint Conference on Mathe- 
matical and Computer Aids to Design (sponsored by the ACM, SIAM, and IEEE; 
Anaheim, California; October 1969). 
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f(t) 7s odd: oA Lf i(r seas) fi(r)] dr + bo} + ¢; 
7=1,2,---,n 
0 (1) 


in which n is an arbitrary positive integer such that n = 2, the ¢;(-) 
and the ¢,;;(-) are monotone functions that map the real interval 
(— ©, ©) into itself, the r,; are nonnegative constants, and the c; and 
the b;;(0) are real constants. 

The set of equations (1) governs the behavior of a nonlinear model of 
the key part of a system for synchronizing digital transmission networks. 
Our main result is that synchronization is possible under very general 
conditions concerning the nonlinearities and the time delays 7;;. In 
addition, an analytical basis for computing the final frequency of the 
svstem is presented; this involves proving that a certain set of nonlinear 
equations possesses a unique solution. Other results are presented con- 
cerning, for example, buffer requirements* and certain monotonicity 
properties of the frequency functions f;(-). 


t 


IV 


1.1 Pierce's Model 


When 7;; = 0 for all 7 ¥# 7, when 9;(x) = x for all7z and all real x, and 
when g;;() = a;,;x for all real x and all 7 ¥ j, in which a,; is a real con- 
stant for all 7 4 7, we have 


KM) = Dad f te — har + v4} + 
¢=1,2,---,n t20. (2) 


Equations (2) are the equations of a linear model of the principal part 
of a system for synchronizing digital transmission networks recently 
proposed by J. R. Pierce.* His system employs oscillators of adjustable 
frequency and buffers which accept pulses at an incoming rate and which 
produce corresponding output pulses at the local clock rate. 

In Pierce’s model the content b;; of the buffer at station 7 which ac- 
cepts pulses from station 7 is assumed to satisfy the equation! 


bi) =f) —f(), t20 (3) 
in which f;(¢) and f,(¢) are the frequencies at time ¢ at stations j and 7, 
_* An explanation of the function of the device called a buffer is given in Section 1.1. 


t As usual, a dot over a mathematical symbol denotes the derivative with respect 
to time. 
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respectively, and the overall system of coupled oscillators is assumed to 
satisfy equations (2) with a,; = a;; = 0 for allz # j. Under the natural 
assumption that there is some path from each station to every other 
station, Pierce has shown, by directing attention to a passive RL net- 
work analog of equations (2), that the model is stable in the sense that 
each frequency f; approaches the same final value as t > «.* 


1.2 The Nonlinear Model 


Our interest in the properties of the solution of equations (1) arises 
as a consequence of Pierce’s work as follows. First, we wish to take into 
account the time delay 7;; associated with transmission to an arbitrary 
station 7 from an arbitrary station 7 ~ 7. Thus we replace f(t) by f;(é — 
7:;) in (3) and (2). The content b;;(¢) of the zjth buffer is then 


[ [fir _ Ti3) = fi(r)] dr + b;;(0) (4) 


for allt = 0. 

Our mathematical model of a buffer does not reflect the fact that the 
capacity of a real buffer is bounded; a real buffer is a device that can store 
at most some fixed finite number of pulses. Therefore it makes sense to 
study how a linear model of a synchronization system employing buf- 
fers, such as the one governed by (2), can be modified to reduce the pos- 
sibility of occurrence of buffer overload (that is, the possibility that the 
capacity of the buffers will be exceeded). It is therefore reasonable to 
replace the expression (4) for the buffer content by some monotone non- 
linear function ;;(-) of (4), with the idea in mind that 9;;(-) is a func- 
tion with moderate slope near the origin and very large slope correspond- 
ing to values of (4) that are in the neighborhoods of buffer overload. 
Similarly, in order to ease the requirements on the extent to which the 
frequencies of the adjustable oscillators must be variable, and in order 
to reduce the tendency of very large excursions in the frequencies f; 
during a transient phase, it is reasonable to replace the sum 


2d, viilbi(d)] | (5) 


formed at the zth station by some monotone nonlinear function ¢;(-) 


* In Ref. 1 Pierce actually deals with a more general linear model than we have 
described here, but treats in most detail the important case described above. In 
connection with the more general model, Pierce has exploited the network analogy 
further in order to obtain an expression for the final frequency, and to make asser- 
tions concerning the behavior of the system when certain elements are nonlinear. 
For additional material dealing with various aspects of the problem of synchronizing 
geographically separated oscillators, see, for example, Refs. 2-7. In particular, 
Ref. 4 contains a short history of the problem. 
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of (5), in which 9;(-) has moderate slope near the origin and very small 
slope far from the origin. 

These considerations lead at once to the study of the properties of the 
set of equations (1). Of course the crucial question is: ‘“Does the system 
governed by (1) possess the basic stability property required of a 
synchronization system?’? Our main result concerning (1) is that, no 
matter what the values of the time delays 7;; , under some conditions 
which are quite trivial from the engineering viewpoint (and rather weak 
from the mathematical viewpoint), it does. 


II. SUMMARY OF RESULTS, AND SOME APPLICATIONS 


2.1 The Main Result Concerning (1) 


In order to describe the result, we first introduce some definitions and 
assumptions. 


Definition 1: Let M denote an arbitrary n X n matrix with elements 
m;;. Let the graph of M denote the graph containing 7 vertices (that is, 
n nodes), a directed edge (that is, a directed line segment or arc) from 
node j to node 7 for every pair 2, 7 with 7 ¥ j and m;; ¥ 0, and no other 
directed edges. 


Definition 2: Let M denote an arbitrary n X n matrix. Then we shall 
say that the graph of M is a communicating graph if and only if there is 
some path (not necessarily a direct path) from each node to every other 
node. 

We assume throughout the paper that: 


(tz) +;; denotes an arbitrary nonnegative constant for all 7 # 7. 
(2z) For each 2, v;(-) denotes a real-valued continuously differenti- 
able function defined on (— ©, ©) such that 


k; S ol(z) Sk, (6) 


for all a, with k; and k; positive constants. 

(vit) For each 7 ¥ j, ¢;;(-) denotes a continuously differentiable real- 
valued function defined on (— ©, ~) such that either ¢;;(x) = 0 for 
all x, or 


ki; S ¢i,@) Ss ks; (7) 
for all x, with k;; and k;; positive constants.* 
* At the price of some additional complication, we could have replaced assump- 


tions (77) and (777) with assumptions concerning the behavior of the ¢;(-) and the ¢;;(-) 
on finite intervals. See Section 2.2. 


SYNCHRONIZING DIGITAL NETWORKS 2979 


(wv) The matrix M defined by 
(M);; = 0 forall 7 
(M),;; = 03,0) forall «1 ¥ j 


is the matrix of a communicating graph. 

(v) Each f;(-) is defined and differentiable on [—7, ~) in which 
7 = max;,z;{7i;}. 

Assumption (zv) possesses a simple physical interpretation. It is a 
natural connectivity assumption of the type needed if synchronization 
is to be possible in the sense that all of the station frequencies approach 
a common final value ast o. 

Our basic set of equations is 


fi(t) = oA od [fitz os Tj) = fi(7)] dtr PO} Pity (8) 


for allz and all? = 0. By differentiating both sides of these equations with 
respect to ¢, we have 


FO = HEC] De eilesOlWt — 1) — FO), 620 ©) 
for all z, in which of course 
E(t) = ys oad re ag) = eer ae 5.0) 

and 

£;() = [ Uiilr — rex) — filr)] dr + 0,,;(0). 

Let h;;(t) = offE:()]¢/;[:;@] for all ¢ = 0 and all j ¥ 7. Then 

fi(d) = a hi (Oil =. Ti) _ fd], t20 (10) 
for all 2. According to Theorem 1 (Section III) the coefficients h,;(-) 
of (10) are such that there exists a real constant p with the property that 
for all i, f:() — p—OQOast— o. This means that the system is stable in 
the sense that all of the station frequencies approach a common final 
value. Note that this result does not involve assumptions concerning the 
values of the nonnegative delays 7;; , that it is valid for monotone non- 


linearities of a very general type, and that it does not involve symmetry 
assumptions such as ¢,;(-) = 9;.(-) for allz ¥ j, 
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2.2. A Monotonictty Property of the f;(-) 


The first of the two lemmas used in the proof of Theorem 1 asserts that 
the solution f,(-), fe(-), --+ , f2(-) of (10) possesses an interesting mono- 
tonicity property. Let 7 be an arbitrary nonnegative value of time ¢, 
and let the upper envelope and lower envelope f(é) and f(é), respectively, 
of the f;(t) be defined for each ¢ = —7 by f(t) = max; f(t), f®) = 
min, f;(¢). Let f;(T) and f;(T), respectively, denote the largest and smal- 
lest value of f(f) and f(é) for t belonging to the interval [—7 + T, T]. 
Then, according to the lemma just referred to, f() < f;(T) and f® = 
f(T) for allt = T. In particular, since the f;(¢) approach a common final 
value, we see that the interval envelope functions f;(T) and f;(T) ap- 
proach each other as T' > o. 

Our assumptions (77) and (777) on the ¢;(-) and the ¢;;(-) concern the 
behavior of those functions for all, and in particular arbitrarily large, 
arguments. The upper and lower bounds just described show that it 
would have sufficed to have made similar assumptions on the behavior 
of the ¢;(-) on any finite interval [—a, a] such that for all z 


g(x) e [f;) — maxe; , f-(0) — mine,] 


for all x ¢ [—a, a]. On the basis of bounds of the type described in Sec- 
tion 2.4, similar statements can be made concerning the pertinent range 
of arguments of the ¢,;(-). 


2.3. Final-Frequency Determination 


We now turn our attention to the matter of determining the final 
frequency of the model governed by (1). 
Let 


pit) = ftir) de (11) 
for all ¢ = O and all z. Then, since for all ¢t 2 0 
t (t-—rTi7) 0 
i fir —_ 7:3) dr = : fi(7) dr + i f;(7) dr, 
0 0 —Tii 
we have, using (1), 
pit) = 9: | > gislpilt t= p(t) a Aasl} “FG; (12) 
for all ¢ and all ¢ = 0, in which 


dy = bi) + file) ar. (18) 


v—Tij 
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According to Theorem 2 (Section ITI), there exists a unique real con- 
stant p and some real n-vector qg such that 


p= 9;{ De Gail — pT: + 4-4: +%]} +e; forall 7. (14) 
iwi 
With p and q such that (14) is satisfied, let 
pit) = pt +a: +7), a =F (15) 


for all 7, in which the qg; are the components of q, and the 7;(¢) are some 
functions of t. Then, using (12), 


p+) = ot Dy vil eres TO Gee NG 
Prt = 7) enw} +c (16) 

for all ¢ and all ¢ = O. But, using (14) and (16), 
*:(t) = gif Zs gslti(t — 3) — 74) + 8i3]} — gal > :18i5]} (17) 


for all 7 and all ¢ = 0, in which s;; = ~—p7r;; + q; -— ai + Ni;.- 
For each 7 and each te [0, ©), we have, by the mean-value theorem, 


gif > giilt(t — 7:3) — ri€t) + 8:;]} — of > viilsis]} 
= i[u;(t)]{ > gilti(t — 433) — 72) + 855] — > viilsii]} 


Trt 


for some u;(é) such that wu,(¢) lies within the closed interval with end- 
points ies ¢:i[Sia] and i ¢:,[r;(¢ = Ti;) — r,(t) + S;;]. Similarly 
for each 7 ¥ 7 and each? « [0, ~), 


gilt st — tes) — 7h) + 845] — giilsis] 
= gf lw Ollrs@ — 713) — 71] 
for a suitably chosen w;,;(t). Therefore (17) can be written as 
*(t) = 2D, eulOlrs(t = 9) TO] (18) 
for alli and allt = 0, where c;;(t) = ¢[u,(t)]¢/;[w;;()]. But, by Theorem 
1, the coefficients c;;(-) of (18) are such that there exists a constant o 


with the property that for all 7, 7;({) ~ o ast — o., It follows [see (18)] 
that for all 7, *;(t) ~ 0 ast— o. Since 


[ fir) dr = pt + q; + 7,(d), i=.0 


for all 7, it is clear that p is the final value of the f,(-). 
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According to Theorem 2: there exists exactly one real n-vector g such 
that, with U" = (1,1, ---, 1), 


U''g = v:{ 2, vul— 74 Ug + 4-4 + Mil} +e 


for all 7, and p = Ug. 

There are some simple special cases in which we can exhibit an ex- 
plicit expression for p. Suppose, for example, that 7;; = 0 for all 7 + j, 
that 6;;(0) = —b;,;(0) for all 7 4 j, and that 9;;(x) = —¢;,(—2) for 
all ¢ ¥ j and all real x. Then, using (14), we have for all 7 


gi (p = C;) = > viild; — Qa: + b;;(0)] 


in which g;'(-) is the inverse of 9;(-), and 
dvi (p SC) = s 2 vislds =e = De (0) = -0. 


Therefore, np = }°; ¢; if g(x) = x for all real x and all 7, or ifn = 2 
and ¢,(z) = ¢2(%) = —¢.(—2) for all real x. 

Finally, as a relevant application of the material of Section 2.2, we 
have when 7;; = 0 for allz 4 j 


min (ec; + o;{ > ¢:;[b:;(0)]}) Sps a (ce; + git > v:i1b.;(0)]}) 


since f(t) < max; f;(0) and f(t) = min, f,(0) for allt = 0, and, by (1), 
{:(0) =e; + ¢.{ > :iLb;;(0)]} 


for all 2. 


2.4. Bounds on Buffer Content 


In order to analytically formulate specifications to be met by real 
buffers such that buffer overload does not occur in a real synchronization 
system of the type under study, it is natural to consider the problem of 
obtaining useful upper bounds on the contents of the mathematical buf- 
fers of our model. We do not treat this entire problem in detail in this 
paper. However, we show here that under some strong assumptions, it is 
possible to exploit the material of Sections 2.2 and 2.3 to obtain a simple 
uniform bound on buffer content. In addition, in terms of the constant 
p and the vector q introduced in Section 2.3, we present a complete 
solution to the problem of evaluating the final value of the content of an 
arbitrary buffer. 

According to Theorem 2, the vector qg that satisfies (14) is unique to 
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within an additive n-vector of the form aU, in which a is a real constant 
and U is the transpose of (1, 1, --- , 1). In particular, the quantity 
A, = (max; g; — min; q;) associated with any solution pair p, q of (14) 
is unique. In this section it is shown that when 


ti; = 6,,(0) = 0 forall i # j, (19) 


then the magnitude of the content 


[tae - tana (20) 


of an arbitrary buffer is bounded for all t 2 O by 2A,. 
Let (19) be satisfied. As in Section 2.3, let 


pit) = [ flsdr, 20 


for all 7. Then with p;(t) = pt + q; +7;(t), t = 0 for all 7, in which p and 
q satisfy (14), we find as in Section 2.3 that for suitably chosen functions 
ui(-) and w;;(-), 


rt) = > e:i(Olrj (0) SO) t20 (21) 


for all z, in which c;;(t) = ¢/[u;()]¢!,[w:;@]. Since (21) is an equation 
of the same type as (10) (more precisely, see Lemma 1 of the proof of 
Theorem 1), it follows that for all ¢ = 0, 7;(4) S max, 7;(0) and 7,(t) = 
min; r;(0). But 7;(0) = —q; for all 7. Thus, for any 7 and 7 with j #7 


p;(t) — p,(t) Ge ge Aw) — AW), t=0 
= 2A t=0 


and, similarly, p;(t) — p;(t) 2 —2A,,t = 0. 

Concerning the problem of evaluating A,, there are some cases in 
which it is possible to obtain simple and useful upper bounds. In one 
simple case we can obtain an explicit expression for A,. For example, 
suppose that (19) is satisfied and that n = 2. Suppose also that ¢,(z) = 
go(z) = —¢2(—2) for all x, and that g:2(7) = g(t) = —¢ai(—2) for 
all x. Then p = gilgie(g2 — q1)] + ¢1 , p = Gelge1(Gi—Ge)] + Ce , and, using 
the fact that g2(-) and g2:(-) are odd, 2g,[¢i2(¢2 — q1)] = c2 — 1. 
Therefore, in this case A, = | go — q | = | e72{e7"[F(ce — c1)]} I. 

We now consider the matter of (proving the existence of and) evaluat- 
ing the final value lim,.,, 0;;(¢) of the content of an arbitrary buffer. 
With p, q, the r;(-), and the p;(-) as defined in Section 2.3, we have for 
t= Oand any72 ¥ j 
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bul) = [tile = 2a) — A) de + 6.40) 


= pit — 7) — pil) + 60) + | far 


~Tii 


= pr Og, = Getty) — 7) 640) 4 : fi(r) dr. 


Since 7;(¢ — 7;;) — 7:(2) ~ 0 ast— ©, we have the result 


to 


0 
lim 6,;() = —pri; + a; — as + 05;(0) + / 2 f(r) dr. (22) 
Finally, if (19) is satisfied, then, using (22), 
max | lim 0,,(t) | = max|q; — q; | = A, , 
aks t—00 Tet 
which shows that our uniform bound 2A, is not unreasonable. 


2.5 Discussion 


The results presented in this paper are concerned with a reasonably 
realistic strongly-nonlinear model of an important type of synchroniza- 
tion system. They answer several key questions concerning the dynamic 
behavior of the system, and provide an analytical basis for using a com- 
puter for further studies in so far as we have proved, for example, that a 
solution pair p, q of the set of equations (14) exists, that this pair is 
unique in the sense indicated, and that it can be determined by com- 
puting the unique solution q of a related set of equations. 

On the other hand, although we have proved that under very general 
conditions our nonlinear model possess the basic properties of a syn- 
chronization system, in this paper we have not considered the next 
natural problem, that of determining the extent to which the system 
performance can be improved as a result of the presence of the non- 
linearities. There are several other important practical problems that 
are not considered here, such as the problem of predicting the effects of 
variable transmission delays (due to temperature changes). There is a 
clear need for much more work in this area, especially in connection with 
the problem of comparing the performance of alternative synchroniza- 
tion systems. 


III. THEOREMS 1 AND 2 
Throughout Sections III and IV: 


(z) nm denotes an arbitrary fixed positive integer such that n 2 2; 
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the statement ‘for all 7’? means for all 7 = 1, 2, --- , n, and “for all 
j ~ v’ means for all je {1, 2, --- , n} except 7 = 7. 

(ii) With v an arbitrary n-vector, v” denotes the transpose of v. 
The zero n-vector is denoted by 0. 

(iz) If x denotes a differentiable function of t, then % indicates the 
derivative of x with respect to ¢. 

(iv) All functions and constants considered are real valued. 


The following two theorems are proved in Section IV. 


Theorem 1: Suppose that the following conditions are satisfied: 

(t) For each 1 # j, a;;(-) denotes a nonnegative bounded measurable 
function defined for all te [0, ©). 

(iz) With a and @ positive constants such that a S 
a;;(-) satisfies either a;;(t) = 0 for all te [0, ©) ora 
allte [0, @). 

(iit) Forte [0, ©), then X n matrix A, with (A);; = a;;(t) for all 
t # jand (A);; = 0 for all 7, is the matrix of a communicating graph.* 

(iv) For each 1 ¥ j, 7;; denotes a nonnegative constant and 7 = 
MAX;n; Ti; - 

(v) For each i, x;(-) denotes a differentiable function defined on[~—7, ©) 
such that 


G, for each t ¥ j, 
< a; ;(t) Sa for 


z(t) = = a:;)[2j(t — 7:3) — xO], t20 


for all 1. 
Then there exists a constant p such that x(t) — pU -— @ast > «, mn 
which U = (1,1, +++ ,1)". 


Theorem 2: Suppose that assumptions (1) through (zv) in Section 2.1 are 
satisfied. Let U denote the n-vector (1,1, --- , 1)". Then (a) there exists a 
unique n-vector q such that 


U"'¢ = o{ Ss ¢i(—T43U q+q;—a +l} +e: forall 7, 
Tt 
in which the \;; and the c; are constants, and (b) concerning the solution 
p,q of 
p= yg: { > viil— pris + q; — q@: + hi; ]} -+- C; for all 1, 


imi 
the value of p is unique, and q is unique to within an additive n-vector aU, 
in which a is an arbitrary real constant. 


* See definitions 1 and 2 in Section 2.1. 
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IV. PROOF OF THEOREMS 1 AND 2 
In this section: 


(t) 1, denotes the identity matrix of order n. 
(it) The transpose of any matrix M is denoted by M*’. 
(ii) If v is an n-vector, then || v || denotes (v‘'v)?. 
(iv) If F denotes an n-vector-valued function, then (F), denotes the 
a” component of F. 


4.1. Proof of Theorem 1 
We first prove the following lemma. 


Lemma 1: Suppose that (2), (iv), and (v) of Theorem 1 are satzsfied. 
For all t « [—7, ~), let £(t) and x(t) denote max, x,(t) and min, 2,(i), 
respectively. Let T' be a nonnegative constant. Then, for allt = T, &(t) S 
supi-r+r,r) E(t) and x(t) 2 infi-s.7,m 2). 


Proof : (upper bound) We have for all z 
z(t) = pS a;;()[x(t — 7:3) — 2,(8)], t 2 0. (23) 
Thus 


IIV 
i) 


&(t) + 2,(t) a a;;(t) = > a;;(t)a(t — 75), 


and 


x(t) = x,(T) exp E: 24 a;;(t) u| 


a rf exp Bi dX a; ;(t) at| », ai (na(r — ty) dr, t= T (24) 


for all 7. 
It is convenient to introduce the function J(-, -, -) defined by 


I(u,v, k) = exp |-f 2s a,;(t) ar| 


for all real uw S v and all positive integer k < n. Thus, for example, (24) 
is equivalent to 


x ;(t) = a(T)(T, t, 1) ote : Ia, t, 1) 2d a; ;(7)x,(r we Ti) az; 


a! AP (25) 
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Let t) denote an arbitrary positive constant. There exist an index k 
and af, e[7', T + ¢t] such that 


v(t) = sup (i). 


{T,T+tol 


Clearly 
(th) = (TIL, ty , k) + in I ry thy B) 3 ayses(r — 145) dr. 
Therefore, since the a,; are nonnegative, 
rh) S m(DUCE, ty) +f Tet BD eal) de 


-max sup x; (t) 
74k [T-rtei,timtes) 


sa(DP, 4, +f Ur,4,%) Dauls) dr sup ad. 


(T-T,t1] 
But 
[16.4.8 Dal) de = 1-10, 4B. 
T ik 
Thus 
tt) SS 2AT)IT, te) + [Lm t-8)). sup. 2). 
[T—T,t1] 
Either 
sup z(t) S sup Z(t) (26) 
(T-7,T] [7,1] 
or 
sup <(t) > sup (0). (27) 
(T-7,T) {T,t1] 


If (26) holds, then 
th) Sa(T)I(T,4,k) + fl — (7, th, k)en(h) 
[since x(t) = sup;r,:,; £(£)], and hence 
t(h) S x,(T), 


which implies that x,(¢,) < sup;r-z,r, Z(t). If (27) holds, then [since 
a,(T) ne SUP[7T-7,7) &(t)] 


2988 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1969 


a(t) SI17,t,k) sup si) +(1—1(7,t,h)] sup 22) 
(T-7,T) [T-F, 7) 


IIA 


sup (i). 


(T-7,T7] 


We have shown that 
sup &(t) S sup Z(t). (28) 
{T-7,T) 


[7,T+to] 


But é is an arbitrary positive number. Therefore 


sup (4) S sup <(Z). 
t2T [T-7,T] 


(lower bound) Our proof of the inequality 
inf z(f) = inf x(t) (29) 
t2T {T-7,T] 


parallels the derivation of the upper bound, and is outlined below. 
There exists an index / and at, « [7', T + t.] such that 


x(t) = inf z(t). 
[7,T+tol] 
Thus 
(th) = r(Tit, bs d+ [l= 10, &, 0). “int 2. (30) 
[T7-T,ta] 
Either 
inf z(t) = inf z(t) 
(T-7,T] {T,te] 
or 


inf x(t) < inf x(d). 
{7-7,T] [T,t2] 
In either case, we find using (30), that (29) is satisfied. 0 
We note that it is a consequence of Lemma 1 that the components of 
x(+) are bounded on [0, ©), and, since x(-) and #(-) are related by (23), 
that the components of #(-) are bounded on [0, ~). 
Assume that 


sup «#(t) — inf x(t) 
lu-7,ul [u-7,u] 
[@(-) and x(-) are defined in the statement of Lemma 1] does not ap- 
proach zero as u — ©. We shall show that this assumption implies that 
the components of z(-) are not bounded on [0, ~), a contradiction. 
Since, by assumption, supyu-7,4; (1) — infj.--,..7 2) does not ap- 
proach zero as u — , there exist a positive constant ¢ and a set {u,}% 
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with u, « [(0, ©) and sup, u, = © such that 


sup Z(t) — inf x(t) = 26 


[ug—-TF,uUg] [(ug-T ug] 
for all g. For each q let t, [u, — 7, u,] and te [u, — 7, u,] be such that 


vi.) = inf z(Z) 
lug-7 ug] 

z(t) = sup (2). 
[ug—T,ug} 


io} 


Of course sup, t, = © and |é, — ¢#| S +. Thus there exists a set {d,}% 
of real constants such that |, | < = for all qg, with the property that 
E(t, +r.) — x(t.) = 2e for all g. It follows from the definition of x(-) 
and «(-) that for each q there exist indices [(qg) and s(qg) such that 
Lr ¢q) (tq a da) — %ecq) (¢,) = te. 

Finally, since the components of «(-) are bounded on [0, ~), there 
exists a positive constant 6 such that for all g 21(q) ( + Ag) — Vac (t) 2 € 
for all te [t, — 46, t, + $6]. 


At this point we need the following lemma. 


Lemma 2: If the hypotheses of Theorem 1 are satisfied, 1f T 1s a non- 
negative constant, and tf there exist three positive constants t, , «, and 6 and 
indices I(q) and s(q) such that t, — 46 > T + 7 and a(t + >) — 
Leg (t) = efor all te [t, — 46, t, + 46], with \, a constant and |r, | S 
7, then there exist positive constants £ and A such that, with €(t) as defined 
in the statement of Lemma 1, 

sup z(t) S sup 2«#()—A 
] 


t2€& {7T-7,T 
and A depends only on a, G, 7, €, and 6. 
Proof: 
As in the proof of Lemma 1, it is convenient to introduce the function 
I(-, +, +) defined by 


I(u, v, k) = exp |- | » ay ;(7) ar| 


for all real u S v and all positive integer k < n. The relation between 


T, 7, t, , and 6 is indicated in Fig. 1. 
From (23) 


x;(t) cal a(T)L(T, t, 1) P [ Ter, t, 1) 2d, a;;(7)x;(7 a Ti3) dr 


for allt = T and all7z. By Lemma 1, #(¢) S sup;r-;,7; €( for allt = T. 
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T-F T tq 
Fig. 1 — Relation between 7’, 7, tj, 6. 


Therefore 


sup @(t) — te o(t) 2 € (31) 


(T-7,7) 
for allée [t, — $6, t, + 46]. 
Let k, be an index such that a,,,..)(t) # 0 for all ¢ 2 0. Then for 
Peto oo sr 


was(d) = ae(DUCL, tbs) +f LGr, ty Ia) SS duals e(e — hi) ar 


<1(T,t,k,) sup a(t) 
(rT-T,T7]) 


t 
+ f Ir, Do angler — 40,5) dr 
T ixky 
iA#s(q@) 
tq—hb+7k, 8 (a) 
=P [ I(r, t, ky)desscay(t)Xscqy(T — Tkrscay) AT 
tat+}o+7e, 8 (a) 


oF I(r, t, kes) Qe, sca) (7) Xs¢a) (7 oe Tkys(a)) dr 


ta-$5+7k, 8 (9) 


oF I(r, t, Ky) dy, sca (7) Xs ¢qy (7 = Feta) OF: 


tathb+7ky 8 (a) 
By Lemma 1, for each j, 
a(t — 3) S sup a0) (32) 
(T-7,T] 


for allt7 2 T + 7,,;. But (82) is obviously satisfied also for 7 e [T, T + 
Tx,;]. Chat is, (82) holds for all 7 2 T and all 7. Thus, using (81), 


,,() S 1(T, t,k,) sup (0) 
(T-7. 7) 


+ f 1, tk) Danke) dr sup a(0) 
T Tvky (T-7,T) 


tgthb+7k, 2 (a) 
a) | I(r, t, Key) Ax, e¢a) (7) dr 


tq—-h54+7k, 8 (a) 


IIA 


tgthitreys(a) 
sup <£(t) — e| I(r, ty ky)dx,s¢q AT 
t 


(T-7.T) q7gOtTk, 8(q) 
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for allt = t, + $6 + 7, since | 
[ Ir, t, In) Yo ay,s(a) dr = 1 — HCP, t, hy). 
But, for all k, , 


tqthd+rk, 8 (q) tqthot+rhy sq) Saf , 
—(n-1)a(t— 
| Is; t, kis, 0ca(7) dt 2 i em Valt=1) ae 
t 


a5 + 7k, 8a) anh 47k 86a) 


1Q 


= K exp {—B[t — tg — Tasca) — 74)} 


in which 6 = (n — 1)@and K = a{1 — exp [—(n — l)dé]}[(m — 1a]. 
Therefore 


x, (0) = sup z(t) — kK exp {—blt =. ta — Tkis(a@) 46]} 
[T-7,T] 


for allt = t, + 36 + 7. In particular, 


sup #(t) — x,,(f) = «Ke ®°*” 
[T-7,T] 


for allte [¢, + 36+7,t, + $6 +7]. Similarly, if the index k, is such that 
ay,x,(t) # 0 for all ¢ 2 O, we have for allt = t, + 36 + 27 


t,(t) = a, (T)L(T, t, ke) 
+f Ter, tha) Do auilaei(r — rai) dr 


sup a(t) — «Ke eO*® 
(7T-7,T] 


IIA 


xD =p = ho. tea $6 — 7)}. (33) 
In particular, for te [t, + #6 + 27, t, + 36 + 27] 


sup. 4) —-a4.0) 2 err. 
(r-7,7) 


Since the graph of A is a communicating graph, we may continue in 
this manner to obtain an upper bound of the type (33) for all of the 
v,(-). More explicitly, for each k, « {1, 2, +--+ , n}, let {k,,k., --- , k,} 
denote a finite set of positive integers, with the integer p dependent on 
k, , such that {k,,k., --- , kj} D {1, 2, --- , n} and 


Dnsk:Uksks cara Qkyk(p—1) ad 0, t = 0. 
Then, with B = sup;r-7,r) €(), uw = eF°*”, and 


T, =t, + 36+74+(r—1)(6+7) forall r=1,2,---,p, 
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we have 
z,,(t) = B-— Kee h'-™, t=T, 
z,,(t) S$ B— Kuch e FO ™ | i= 7, 
t,,(t) SB — Kur te FeO t=T, 


Now let ¢ = T, + 7 with 7 (0, 7]. Then 
a) SBS Kee eee 


a, (t) <B- eK ue Pie B82) + o— An 


a, (t) S$ B— eK?u? te Fe*". 

Thus, for allée [J,, T, + 7], 
w(t) S B— Ai, 

for allr = 1, 2, --+ , p, in which 

Ay Sanam fea ee 
Let A = min,, A;, , and observe that A depends only on a, 4d, 7, e, and 
6. By Lemma 1, 

at) S$B-—A 


foralti=27,+7.0 
Since as indicated earlier, there are an infinite number of 6-intervals 


with centers t, such that sup {é,} = , and such that there exist in- 
dices [(q) and s(q) with the property that 
Bicay(E + Ag) — Lecay(t) 2 € (34) 


for all te [é, — 46, t, + 36], with the constants \, such that | \,| S 7 
we see that Lemma 2 and the assumption that 
sup (4) — inf z(d) (35) 
[u-—7,u] [u-7,u] 
does not approach zero as u — © imply that #(@) ~ —o ast >, 
which contradicts the fact that <(-) is bounded on [0, ©). Therefore 
(35) approaches zero as u > ©. But, by Lemma 1, supyu_;..; @(0) is 


SYNCHRONIZING DIGITAL NETWORKS 2993 


monotone nonincreasing in wu and bounded from below. Thus there is a 
constant Z such that 

sup a(t) > L 

[u-7T,u] 
as u — o. Similarly, by Lemma 1, inf,,_;,.; ¢(é) is monotone non- 
decreasing in w and bounded from above. Thus there is a constant 
Z such that 

inf z(t) > L 

[u-7,u] 
as u — ©. But we have proved that L = L. Therefore &(t) and z(t) 
both approach L as t > ©, which means that there is a constant p such 
that 


a(t) —-pU->60 as t>o~. 


4.2. Proof of Theorem 2 

In part (a) of this proof we employ a theorem of R. S. Palais* ac- 
cording to which: if F(-) is a continuously-differentiable mapping of 
real Euclidean n-space E” into itself with values F(q) for q e E” such that 


(t) det J, € 0 for all qe #”, in which J, is the Jacobian matrix of 
F(-) with respect to g, and 

(2) limye+» || F(@ || = ©, 
then F(-) is an invertible mapping of E” onto itself and F(-)~’ is con- 
tinually differentiable on E”. 


We have 
Ug = 0:{ Deoul—trU' Gd + a — ae +H ]} te: for alla. 
jt : 


Let F(-) denote the mapping of E” into itself defined by the condition 
that for all 7 and all qe E”: 
[F@]: = U"'¢ — vil pe gil 7390" + as — Gs + Nal}. 


7At 


Our objective is to show that F(-) satisfies conditions (2) and (22) of 


Palais’ theorem. 
We have, with F; denoting [F(-)]; , 


on =1+od/0+7)¢%; forall? 
i pHi : 


* See Ref. 8 and the appendix of Ref. 9. 
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and 
OF, ; : F : 
ONE ade ee eae oF >» Tit; — Give forall k #1 
Od: ii 

in which 

= otf » eil—7i3U''g dy de aa 
gt 
and 


ot; = ehl—TtsU" 9 + a — Gs + ONG). 
Let 8;; = ¢/y!; for allz ¥ j, let V be the n-vector defined by 
Vv" = (1 = D> Bist; 5) 1 + > BojT2; er 1 + ye Biitas 
71 72 jAn 

and let B denote the n X n matrix defined by 

(B);; = >> 8; foralli,  (B),; = —8;; forall 7 # j. 
pri 

Then J, = B+ VU". 

Suppose now that det J, = 0 for some n-vector g. For that q, there 
would exist an n-vector x ¥ 6 such that J's = (BY + UV")x = 6. 
Since the column space of B"’ is orthogonal to U, we must have B’’x = 0 
and V's = 0. But B is of rank (n — 1) and the cofactors of B are non- 
negative. * 

Thus B'’x = @ implies that x = éy, in which y is any column of the 
matrix of cofactors of B and & is some real nonzero constant.’ But we 
must have V's = £V‘'y = 0, which is a contradiction, since at least one 
element of y and all of the elements of V are positive. Therefore F(-) 
meets condition (2) of Palais’ theorem. 

We now show that F(-) satisfies condition (zz) of the theorem of Palais. 

It is a simple matter to verify that for all 7 

f,= U'"q —~T; drt rsU""g 0p gal elu 0s) 
in which 


Tt 


Dd vil—tU'"g ae eo) De vil) 


ii Tt 


gil > gil—71;U'"g de ee Ais]} = vit Do vesldss]} 


1 


* See Ref. 10 for a proof that B is of rank (n — 1) and that the cofactors of all of 
the (B);; elements of B are positive. Since BU = 6, each of the columns of the 
transposed matrix of cofactors of B is proportional to the vector U (see the footnote 
that follows). Therefore all of the cofactors of B are positive. 

t This follows from the well-known proposition that M‘C = 1, det M, in which 
M is any square matrix and C is the matrix of cofactors of M/. 
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and 


gif—7.;U'"@ + 45 — 9: + Ail = oi:[ris] 
—7,,U''@ + 4; — 4 





r= ) 
with the understanding that r; is unity when the corresponding numera- 
tor is zero, 7;; is zero for all g and all 7 ¥ j for which ¢,; is identically 
zero, and 1;; 1s unity for all 2 + 7 for which ¢;; is not identically zero and 
for all q for which the corresponding numerator is zero. Therefore, for 
all ge E”, F(q) = Mq + 8, in which then X n matrix M is obtained from 
J, by replacing ¢/ by r; and ¢/; by r,; for all z and allz ¥ j, respectively, 
and the ith component of s is —¢,[>.j;4: ¢:;(Az;)] for all z. In particular, 
det M # 0 for all g. Therefore det (1/‘*M) > 0 for all q. Since all of the 
r; aS well as all of the nonidentically zero r;; are bounded above and 
below by positive constants uniformly for g« E”, there exists a positive 
constant ¢ such that det (MM) = e for all ge E”. 

Let \;, \2, °** , A, denote the eigenvalues of 1/‘"M. Then \,d2 --+ An 
= efor all ge E”. Assume that \, S A. S ---: S d,. Since all of the 7; 
and all of the 7;; are bounded from above uniformly for g « E”, there 
exists a positive constant \ such that \, S 2 for all ge EZ”. Thus, for all 
qe E” we have \,; = ed)”. Therefore, 


1 PG@) Il = Ma +e (2 Mall - UIs 
ate all Mel 


IV 


IV 


for all ge EZ”, from which it is clear that || F(q@ || ~ © as ||q|| > @ 
This completes the proof of part (a) of our theorem. 

Next we show that there is at most one p with the property that there 
exists a ge EZ” such that for all 7 


p= ef Dy eal pris +q;—a+ A.J} te . 


(a) (6) 


Let p‘” and p*” be two constants, and q two n-vectors, such 


that for all 2 
p= of Dd) vil pO ras HQ? — GS? + AG]} +e: 
tra 


and q 


pe” = of Deul—p Pr +a? — a +d} te. 


742 


Then with q° = q‘” + aU, in which the constant a is chosen so that 


p? — p”? = U"(q® — q), we have for all 7 


oe” = of Deil—p Ora + ai? — a? + Mal} +e 


71 
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where p‘“” = p°”. Thus 
po” — a = pil De eil— p75; =F qs” pa qe + deal} 
Trt 
— o:{ Dd Gil Po 753 a q;° = qi + Aas]} 
iAi 


and 
pe as ee = GG? = gy: 


Therefore we can define nonnegative ratios p; and p;; similar to the 
r; and r;; above, such that 
Ug? — ¢) =p: Depil-U"'@? — a) rs 


Tt 
+ (q3? — gf) — (ai? — g:°)) for all z, 


and such that these equations are equivalent to M’(q — q°°) = 6 
in which the n X n matrix M’ is obtained from J, by replacing 9 by 
p; and ¢/, by p,; for all 7 and all7z ¥ j, respectively, so that det M’ ¥ 0. 
But this implies that q‘“” = q° and hence that p® = p\”. 

We shall how prove that q is specified to within an additive vector 
of the form aU in which a is a real constant. 


Suppose that, with g“ and q“ two n-vectors, 
p— vil 2, eal Pres + a3? — ai? + as)} 
=p— ii Dy eel eri gy ge Nel) 
for all 7. Then, with the p; and p,; as introduced above, 


Di Dd Dili? — af”) — (i? — ai?)] = 0 


TAt 


for all 7. Thus, since p; ¥ 0 for all 7, and the n X n matrix P defined by 
(P)is = Yop, forall? 
tt 


(P);; = —pi;, forall «4 j 


is of rank* (n — 1), and PU = 6, we have gq — q° = aU for some real 


constante. O 





* See Ref. 10. 
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Overflow Oscillations in Digital Filters 


By P. M. EBERT, JAMES E. MAZO, 
AND MICHAEL G. TAYLOR 


(Manuscript received May 9, 1969) 


The cascade and parallel realizations of an arbitrary digital filter are 
both formed using second order sections as building blocks. This simple 
recursive filter is commonly implemented using 2’s complement arithmetic 
for the addition operation. Overflow can then occur at the adder and the 
resulting nonlinearity causes self-oscillations in the filter. The character 
of the resulting oscillations for the second order section are here analyzed 
in some detail. A simple necessary and sufficient condition on the feedback 
tap gains to insure stability, even with the presence of the nonlinearity, ts 
given although for many desired designs this will be too restrictive. A 
second question studied is the effect of modifying the “arithmetic” in order 
to quench the oscillations. In particular it 1s proven that tf the 2’s comple- 
ment adder is modified so that it “‘saturates’’ when overflow occurs, then no 
self-oscillations will be present. 


I. INTRODUCTION 


A digital filter using idealized operations can easily be designed to be 
stable.’ Nevertheless, in actual implementations, the output of such a 
stable filter can display large oscillations even when no input is present.* 
A known cause of this phenomenon is the fact that the digital filter 
realization of the required addition operation can cause overflow, 
thereby creating a severe nonlinearity. Our purpose here is twofold. 
The first is to give a somewhat detailed analysis of the character of the 
oscillations when the filter is a simple second order recursive section with 
two feedback taps. This unit is the fundamental building block for the 
cascade and the parallel realization of digital filters, and as such is 
worthy of some scrutiny.” A simple conclusion which one can draw from 

* To the best of our knowledge, these oscillations were first observed and diagnosed 
by L. B. Jackson of Bell Telephone Laboratories. 

t In the present work rounding errors in multiplication or storage are neglected 


and therefore so are the little-~inderstood oscillations attendant upon these non- 
linearities. 


2999 


3000 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1969 


the analysis is that the design of many useful filters requires using values 
of feedback coefficients such that the threat of oscillations is always 
present (with 2’s complement arithmetic). Optimum solutions that cope 
with this state of affairs are still unknown. Some recent proposals include 
observing when overflow at the adder is to occur and then taking ap- 
propriate action. Our second purpose, then, is to discuss the effectiveness 
of some of these ideas, and to give a proof that modifying 2’s complement 
arithmetic so that the adder “‘saturates’’ is an effective way to eliminate 
the oscillations. Questions of how this nonlinearity will affect the desired 
outputs from a particular ensemble of input signals are not yet answered 
however, and perhaps for some applications other solutions need be 
considered. 


II. PROBLEM FORMULATION AND GENERAL DISCUSSION 


As explained in the introduction, this paper deals primarily with the 
simple structure shown in Fig. 1. The outputs of the registers, which 
are storage elements with one unit of delay, are multiplied by coef- 
ficients a and b respectively, fed back, and “‘added” to the input in the 
accumulator. No round-off error is considered either in multiplication 
or storage, but overflow of the accumulator is not neglected. In other 
words, the accumulator will perform as a true adder if the sum of its 
inputs is in some range; otherwise a nonlinear behavior is observed. 

Iigure 2 shows the instantaneous input-output characteristic f(v) 
of the device motivated by using 2’s complement arithmetic. It is also 
important to note that there is no memory of the accumulator for 
past outputs; that is, the device is zeroed after the generation of each 
output. 

If we let x(t) be the input signal to the device, y(¢) the output, and 


b 





Fig. 1 — Basic configuration for the digital filter-yri2 = f[ayng: + bye + resol. 
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OUTPUT = f (v) 





Fig. 2 —- Instantaneous transfer function of the accumulator. 


f(-) the nonlinear characteristic of the accumulator, we have the basic 
equation 


y(t + 2) = flay + 1) + by® + xt + 2)). (1) 


We shall be concerned with the self-sustaining oscillations of the device 
that are observed even when no input is present [x(t) = 0], and when 
linear theory would predict the device to be stable. 

By making this linear approximation f(v) = »v, the linearized version 
of equation (1) becomes, with no driving term in the equation, 


y(é + 2) — ay + 1) — by@) = 0. (2) 
The roots of the characteristic equation for equation (2) are 


2 4 4p)3 
pi. = oe (3) 


and the region of linear stability corresponds to the requirement that 
| p; | < 1. This region is depicted as a subset of the a-b plane in Fig. 3. 
One has | p; | < 1 if and only if one is within the large triangle shown in 
Fig. 3. For this situation any solution of (2) will damp out to zero after 
a sufficient period of time. Now note that (2) is not necessarily a valid 
reduction of (1) even when x(¢) = 0. The output, by choice of f, has been 
assumed to be constrained to be less than unity, but this is not sufficient 
to guarantee that the argument of the function f is less than unity. For 
this to be the case we require 


Jay(¢ + 1) + by@®| <1. (4) 
Since | y(t) | < 1, equation (4) will always be satisfied provided that 
ja; +o] <1. (5) 
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(-1,2) a 





(-1,-2) 


Fig. 3—Some interesting regions in the ‘‘space’’ of feedback tap weights. The hatching 
indicates stability even with the nonlinearity. 


The subset of the a-b plane for which (5) is true is shown in Fig. 3 
with vertical hatching, and is a subset of the region of linear stability. 
It is shown in this Section that if (5) is not satisfied there always exist 
self-sustained oscillations of the digital filter and hence (5) is both a 
necessary and sufficient condition for absence of self-sustained oscilla- 
tions.* One way to avoid the oscillations in question is simply to impose 
the requirement (5). This trick has its limitations, however, for it clearly 
restricts design capabilities. The region of the s-plane which is shaded 
in Fig. 4 shows the allowable pole positions. Roughly speaking, one con- 
cludes that there are desirable filter characteristics that can be realized 
with this restriction and there are desirable characteristics that cannot. 

It is not our purpose here to outline those applications for which (5) 
will not be restrictive; we proceed to sketch the situation when | a@| + 
|b| > 1 and the threat of oscillation is present. Sections III and IV 
contain, we believe, a novel and interesting mathematical treatment of 
the general problem of classifying the self-oscillations of the nonlinear 
difference equation (1). However, for the user of digital filters a simple 
proof of the |a| + || > 1 being sufficient for threat of oscillations is 
of more immediate interest. After reading the simple proof of this fact 
given next in the present section, such a reader may wish to proceed 
directly to Section V. 

Consider the possibility of undriven nonlinear operation giving a dc 


*T. W. Sandberg has informed the authors that the necessity and sufficiency of 
) Lone for absence of oscillations ae also been obtained jointly by him and 
B. Jackson. 
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output, that is, y, = y for all k. Equation (1), with z(t) = 0 becomes 
y = j[(a + b)y]. Assuming for definitness that y > 0, we can easily see 
from Fig. 2 that the above equation will be true if (a + b)y = y — 2, 
which implies y = 2/(1 — a — b). One can show (see discussion follow- 
ing equation 17), that this y will have magnitude < 1 provided only 
that the tap values a and 0 lie in the region labeled J in Fig. 3. Thus a 
consistent de oscillation is always possible for all (a, b) pairs in this 
region. Next consider the possibility of a period 2 oscillation. This 
amounts to finding a consistent solution to y = f[(6 — a)y]. Proceeding 
as before we obtain 
_ 2 
y = 1l+a-—- b 


Thus y; will be given by (—1)*y, and will have magnitude less than unity 
if the (a, b) pair lies anywhere in region II of Fig. 3. 


III. FURTHER ANALYSIS OF THE OSCILLATIONS 


To analyze equation (1) in greater detail, it is very convenient to 
write it in the form similar to (2), 


y(t + 2) — ay(t+ 1) — by(t) = Dd) a,u(t + 2 — n), (6) 
wT 


SUENSSSNEANS SNS SAN SNES SN SENS RSS SN Y 
NNN NAN NAN NN NNN NNN ANNAN SNS NINN ANS 


os 


SAAS D™D?[?$5+?&[S[C&ESU 
SA NANNAAAS SANS ANN ANNAN AN NNNNNAANN ANAS 


vhs 


1Y 


CLALENSASS ALLL LLALALAOLSLAOLLLLONAL LA LOPES LEAL AOL SS OEE SSL RSS OO 
4404 LLL SOOLOLSLOLLLLPLSL OLAS LL LAS ppp 


Fig. 4 — Pole locations in the s-plane (shaded pee realizable under the constraint 
that |a] + |b] <1 
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where u(é) is a square pulse of unit height that one may conveniently 
think of as lasting from t = 0 until ¢ = 1. This, of course, means that 
one interprets the solution of (6) to be a piecewise constant function 
like the actual output of the digital filter. For mathematical manipula- 
tions it is sometimes desirable to also interpret (6) as a difference equa- 
tion, defined only for integer t. In this case one would write that u(t — 7) 
= 6,, where 6,, is the familiar Kroneker symbol. 

The point of the right side of (6) is simply to keep | f(v) | < 1 re- 
gardless of what value v has. From Fig. 2 we see that if |v | < 1, this 
added term is not needed and we take a, = 0. If 1 < v < 3 then we take 
a, = —2,andif —3 < v < —1 wetakea, = +2. Since we have that 
| y(t) | < 1 and that linear stability (see Fig. 3) implies | a | < 2,|b| <1, 
we need not consider further values of |v |. Thus in (6) a, = 0, +2 
depending on whether or not v(t) = ay(t + 1) + by(é) crosses the lines 
v = +1. It will be convenient to have a word for such crossings; we 
shall call them ‘‘clicks’”, borrowing a favorite word from FM theory. 
Then a, = 0, +2 depending on whether or not a click does not, or does, 
occur. 

Note if one knew what the click sequence {a,} was, one could solve 
(6) simply by using the clicks to be the driving term for a linear equation. 
We are mainly interested in describing the self-sustained steady state 
oscillations of arbitrary period N. Hence initial conditions will play no 
essential role for us, for while they determine which oscillating mode 
appears as t > ©, they play no role in describing the modes. Our pro- 
cedure will be as follows: 


(t) Assume a click sequence of period N; 
Mo , G1 » 42, °** » an-1 


GQin+k = Ap. l= 0, 1,-:- (7) 
OSk<N-1., 


(77) Using the assumed {a,}, find the steady state solution of (6). 
However, only solutions that have | y(t)| < 1 for all ¢ are allowed. 
(it) Check that this steady state solution actually generates the as- 
sumed click sequence. 

In carrying out the above program for some simple cases we observed 
that step 777 never seemed to yield anything new. Indeed, surprising as 
it seems at first glance, step 777 never has to be carried out. If one obtains 
a solution with | y() | < 1, this solution is consistent. That is, it auto- 
matically generates the assumed click sequence. The proof is simple. 
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One calculates the argument of the function f from (6): 
ay(t + 1) + by) = yt +2) — Liaut+2—n). (8) 


We have a click at time ¢ + 2 = mif | ay(m — 2) + by(m — 1)|> 1. 
From (8), 


| ay(m — 2) + by(m — 1)| = | y(m) — a, |. (9) 


Note then if in (9) a, = 0, then | ay(m — 2) + by(m — 1) | = | y(@~m) | 
< 1; thus if there is no click at a particular time in the assumed click 
sequence the “‘solution’’ will not generate one. Next assume a, = +2; 
then 


ay(m — 2) + by(m — 1) = y(m) —2 < —I, (10) 


where we use | y(¢) | < 1 again. Equation (10) says if a positive click 
is present in the assumed click sequence then the solution obtained from 
the linear equation (6), given by this click sequence, will reproduce the 
positive click. Obviously the same argument holds for a negative click, 
@, = —2, and the proof of this point is complete. 

The steady-state solution of our fundamental equation (6) for an 
arbitrary click sequence {a,,} of period N is derived in the appendix. 
If we define 


Ay-i(4) = D3 de" (11) 


n=0 
and 
D(z) = 2 — az — 3, (12) 


and let r; ,7 = 1, ---,N,bethe N Nth roots of unity, then the (periodic) 
output values are given by 
1 
1p ie Anil?) ; 


uN & De) 7 


The above expression gives the {y,} output sequence for any click se- 
quence. We emphasize, however, that it is only a solution correspond- 
ing to a self-sustained oscillation of the digital filter if we have | y, | < 1, 
all k. Whether or not this is true depends on the particular click sequence 
assumed. 

Another form of the solution can be obtained by manipulation of (13). 
To write this down, define 


bo = (Gp-1-n a Gz-1-n+n)/2; (14) 


(13) 
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where we understand é; = 0 if 7 does not lie between 0 and N — 1, 
inclusive, and dG; = a, if it does. One of the a’s in (14) will thus always be 
zero and b\* has values of +1, 0. The other form of the solution is then 


B® pi po 
2 Fel 8, 2 


Pi — P2 n=0 1 — pi 1 — po 
k=0,1,---,N-1 (15) 


where p; are given in (8). 

In (15) we have N vectors of dimension N, namely the {b\”} k = 
0,1, 2, --- , N — 1. Note from (14), however, that they are all cyclic 
permutations of one another. Hence we may refer to the b vector, b, of 
a solution, understanding that the b and all its cyclic permutations 
generate a solution in the sense of (15). Note that a cyclic permutation 
of the y, has no real significance here; it simply changes the origin of 
time. 

An interesting property of the solutions which we have written down 
follows from the fact that if we transform the point (a, b) in the ab-plane 
into another point by 


a—a’ = —a (162) 
b— b’ = b 
then under this transformation 
Pi — pi = — peo (16b) 
P2 > p2 = —pr. 


The property is this: Let N be an even integer and let b = (b),0,,-°-- , 
by-1) be a click vector generating a solution at point (a, b). Then the 
vector b’ = (bo, —bi, b2, —bs, --+ , by-1) generates a solution at 
reflected = (—a, b). The proof is simple. Note from (15), 


pr” mn 
(kK) = 2 (k) Pi P2 
y? = yr] | 
FD, 


Pi Pots v1 pi 





ktn a p2)” Coy | — (—1)*,, 
=e 2a (- ) | <0" po 1— p\ ( ny : 


Hence if | y“ | < 1 then | y’” | < 1. Note that the proof also supplies 
the value for y’“ in terms of y“’. This theorem will be used later to 
generate new solutions from old ones. 

Before leaving this general discussion in favor of exhibiting some 
solutions in the next section, we list a few more observations related 
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to the click vector b. The click vector b, whose only allowed component 
values are +1, 0, completely characterizes the associated oscillation. 
Clearly there can then only be a finite number of oscillations of given 
period N. This number is upper bounded by 3”, but will generally be 
much less. Also note that a cyclic permutation of the components of b 
cyclically permutates the output values y’*, and this latter is merely a 
shift in time. The permutated values are not physically distinct. 

Also note that if we perform b — —b then y > —y, and a solution 
of opposite sign is obtained. While this may often be distinguishable 
from the first solution, it is trivially related to it. Finally if one were to 
count the number b vectors of dimension N that yield new information, 
one would wish to exclude subperiods of NV. Thus if (+, 0, 0) is an gen- 
erating b vector for period 3, (+, 0, 0, +, 0, 0) generates a period 6 
oscillation but this is not new information. We have not solved the prob- 
lem of counting how many of the 3” vectors are left after we impose the 
requirements of cyclic shifts, sign changes, and subperiods. At any rate, 
it is essential to test the ones that remain to check that they generate 
allowed solutions, | y* | < 1. 


IV. SOME EXPLICIT PERIODS AND REGIONS OF OSCILLATION 


Now for a few explicit solutions. Consider the possibility of a de 
“oscillation”, namely, set N = 1. The only nontrivial click vector is 


b = (+). The solution is more immediate if we use (13). We have 
2 
~1l-a-—b nO 


for the de value of output. For what values of a and b within the triangle 
of Fig. 3 will we have | y| < 1? We require 


|l-—-a—b|>2 (18) 
which is equivalent to either 
l1—~a-—b>2 (19a) 
or 
=—IT+ a+ b> 2. (19b) 


Inequality (19a) (coupled with the linear stability requirement) defines 
the triangle labeled “J’”’ in Fig. 3, while (19b) is outside the stability 
region and needs no further consideration. Thus any portion of the 
region a < 0 that we have not excluded from oscillations has now been 
shown to have them. They are of period 1; other period oscillations may 
(and do) occur in this region. 
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At this point it is amusing to use an earlier remark on the possibility 
of generating new solutions from an even period one by “‘reflection’’. 
Letting N = 2, the click vector b = (+, +) certainly generates a period 
2 oscillation (albeit one with subperiods) in region J. Then the click vec- 
tor b = (+, —) generates something really new: a period 2 oscillation 
in the region labeled II in Fig. 3. The amplitudes of the output are 


ea ae | eee eer (20) 
y l+a-—b)b’ ; 


One more possibility of a click vector exists for period 2, and that is 
b = (4+, 0). From (13) we write for possible output values 


1 1 
te a Sp ea 
L—@=6" 1ta@— 0 (21) 
1 1 








a Leb 1a 
After a little uninteresting analysis one can conclude that we cannot 
have | yo | < 1, | y: | < 1 in (21) for any allowed values of a and b. Thus 
there are no other period 2 oscillations. 

On to period 3. Now there are four click vectors which must be con- 
sidered. These are (+00), (+-+0), (+—0), (+-+—). Even in this case 
an exhaustive check that the ‘‘solutions” generated are legitimate ones is 
trying. Therefore, we resort to a trick; we look for periods which may 
exist in the immediate neighborhood of the point (a = 0, 6 = 1). This 





means p, = 12, p. = —12. In this immediate neighborhood p, = p*, and 
(15) reads 
N-1 b 2” 
YF a ow (22) 


where we have let z = p,. Letting N = 3, z = 1 gives 
Yo = —b, + b, + by 
Y1 = —b, + bo + bo (23) 
Yeo= —b +b+b,. 


We now require y, = -+:1 as a test for the click vector b. We sce that 
only (+00) qualifies as possibly yielding a solution in the neighborhood 
of (a = 0, b = —1). A computer study shows that indeed the solution 
extends into the interior of the triangle and the region found is shown 
in Fig. 5. This immediately implies existence of the period 6 oscillation 
generated by (-+00—00) in the reflected region. Similarly, a period 5 
oscillation region (with the concomitant period 10) generated by 
(+0000) is shown in Fig. 6. 
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1.0 
0.8 
0.6 
a 
0.4 
0.2 
1.0 -0.9 -0.8 -0.7 -0.6 
b 
Fig. 5 — A region for period 3 oscillations. 
It is very tempting to conjecture that the point (a = 0,b = —1l)isa 


boundary point of any allowed region of oscillation. If this is true, a 
procedure like that used above may eliminate some otherwise very 
respectable b vectors from consideration. Note that for N = 2, b = 
(+, 0) satisfies the required condition at p; = 7, but we have shown this 


b 
gle -0.9 0.8 -0.7 -0,6 
-0.2 
a 
-0.4 
-0.6 
-0.8 


Fig. 6 — A region for period 5 oscillations. 
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output = f (v) 





Fig. 7 — Zeroing arithmetic, shown above, also gives rise to oscillations. 


is not extendable into the interior of the triangle. Hence existence at 
z = 2 does not guarantee an allowed solution. 


V. STABILITY WITH A MODIFIED ARITHMETIC 


In an attempt to eliminate these oscillations, proposals have been 
made which rely on detecting overflow. One such suggestion dictates that 
when overflow occurs, the adder is directed to shift out zero. For ref- 
ereace we call this zeroing arithmetic. The effective transfer function of 
the adder for zeroing arithmetic is given in Fig. 7. However, it can be 
shown by numerical example that such a procedure still leads to oscil- 
lations. Another possibility, “saturation arithmetic,” is displayed in 
Fig. 8. Here a one (with the appropriate sign) is put out when overflow 
is detected. The remaining portion of this paper is devoted to proving 
that saturation arithmetic leads to stable operation whenever linear 
theory would predict it to be so. 

To begin, we suppose for the moment that we ignore the fact that 
the digitally implemented adder is nonlinear. Then the second-order 
linear difference equation which governs the behavior of the undriven 
system has solutions y, which may be described as follows: 


Case 1: Complex roots for characteristic equation 
yz, = Re Ky exp (—ak), K,andacomplex, Rea > 0. 
k= 0,1,2,:::. (24) 
Case 2: Real but unequal roots 


= K, exp (—ak) + K,exp (—Bk). K,;real; a>0, B>O0. (25) 


oS 
= 
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Case 3: Real and equal roots 
y, = [K, + K,.k] exp (—ak). K; real; a> 0. (26) 


Using this information, coupled with knowledge of y; and y;,, for some 
j, it is easy to give a bound on the magnitudes of all future (k = 7) 
values of the output and to show this value goes to zero with increasing 
j. This is just another way to say that the solutions go to zero for the 
linear case. In the nonlinear case we cannot exclude the situation that 
some ¥,4, will exceed unity and the nonlinearity will be operative. For 
saturation arithmetic the offending value must be set to unity if, for 
example, y,:; > +1. We can, for conceptual purposes, regard this as a 
“squeezing” of the output from a value greater than unity down to the 
value one which is performed in a continuous fashion. The crux of the 
proof now comes in showing that the partial derivative of our bound 
(on future outputs) with respect to the most recent output y;,4, has, for 
saturation arithmetic, the same sign as y,,:. Hence decreasing a value 
that is too large in magnitude will decrease the bound as well, and it 
will go to zero at least as fast as it does for the linear case. 

To show how the above outline works, consider first the linear case 
with complex roots. From the form of the solution 


y, = Re K, exp (—ak), Rea>0O, k=0,1,2,---, 
it is clear that if we define 
By => |Kol” (27) 


then y; < B, for all k = 0. We now express B, in terms of the values 
Yo, Y1 Which are initially stored in the shift registers to yield 


4 [ = yo Re exp (a)! (28) 


Bo = Yo [Im exp (—a)]’ 


This suggests that one define the more general set of numbers 


B, = yj + We qi te eal (29) 
Clearly, from the way that B, is defined, we have that 
yz, = Re K; exp[—a(k — 9)], k23 (30) 
where K; is some appropriate complex number that satisfies 
Be =| _Ke/’, (31) 


From (80), the additional inequality that y? S$ B, for all k = 7 follows. 
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Furthermore, one can see by comparing (80) and (24) that 
| K, |? = | Ko |? | exp (—az) |*. (32) 


Hence, since the real part of a is positive, B; goes monotonically to 
zero with increasing j. 

To generalize the above arguments to a nonlinear situation of in- 
terest,* consider the following equation which follows from (29): 


OB, _ 2 
Oyi+1 [Im exp (—a)] 





3 [Yi+1 — y; Re exp (—a)]. (33) 


Now imagine B,_, has been calculated from values stored in the registers. 
From linear theory we predict y‘¥) and B\” < B,_, exp (—2a), by (82). 
Now if the y{%} generated by the linear equation were too large, say, then 
decreasing it to unity would, according to (33), decrease the bound B; if 
we knew that 

Vist — Yi Re [exp (—a)] 2 0 for yi) 2 yin 2 yjsi (84) 
where y‘%? is the linear prediction for y;., and y{°) is the correct value 
for the nonlinear circuit resulting from ‘‘squeezing’’ y{%} down. Since 
|y;| < 1 and Re exp (—a) < 1, (84) is always true for saturation 
arithmetic (see Fig. 8) because y‘%} = +1 (assuming y{%7)} > +1) and 
(34) can never swing negative. Similar things happen, of course, if 
Yi+1 < —1. Thus the bound decreases at least as fast as for the linear 
case (which is exponential) and stability is assured. For zeroing arith- 
metic y{°} = 0, and thus the appropriate sign for (34) cannot be guar- 
anteed which is in satisfying agreement with the known instability for 
this case. 

For the next case of real but unequal roots, we now have reference to 
equation (25) and define our initial bound as 


By = 2(Ki + KY 
_ 5 i — exp (—a)yol” + [yr — exp (—8) yo] 
= [exp (—a) — exp (—6)} oe 


The remaining details are too similar to those of the preceding case to 
warrant recording again; stability for saturation arithmetic holds here 
as well. 

The last case to discuss occurs when we have real and equal roots. 





* B; calculated from (29) is a bound on future outputs for the nonlinear as well as 
the linear case. If B; < 1 the two eases coincide, while of B; > 1 the conclusion 
follows equally trivially since |y,| < 1 for the nonlinear situation. 
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ouTPUT = f (Vv) 





Fig. 8 — The above nonlinearity corresponds to saturation arithmetic and leads to 
stable behavior. 


This situation, represented for the linear equation by equation (26), 

is more difficult to treat than the previous ones. The analog of (27) and 

(35) now is 

4K? 

4K? 
2 


a 


B, = max (36) 


That (36) yields a bound follows from the facts that (for ¢ 2 0) 
y;, <= max [(K, + K.t) exp (—at)? 


A 


< 2 max [Ki + K2t’] exp (—2at) 
t 


max Ki exp (—2at) 
<4maxJ ' 
max K3t’ exp (—2at) 
t 


K; 
= 4 max K? exp (—2) 
oo 
Ky 


< 4 max 2° 
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Since 
ne) 
Arm Yo (37) 
K3 = (y; exp a — Yo)” 
a. a ’ 
we define our general bound as 
2 
Yi 
B; = 4 max Gare exp Acs Yi) (38) 
a 
Using the solution y; = (K, + Kj) exp (—aj), we see that 
— 2 
6, = Wiz OP oe — yi) (39) 


a 


decreases by the multiplicative factor exp (—2a) for every unit increase 
of j. Further, suppose that B; = 4y+ for some j. That is, suppose 





2 
(Yaar =e Yi) < y3 (40) 


This implies 
Vier < yi(1 “P a)” exp (—2a), (41) 


and so if next time B;,,; = 4y},,, then we have decreased by 
(1 + a)? exp (—2a) < 1. On the other hand, if at the next step we have 
to choose B;,, = 40;.1, we see 


Byas 6541 Os 41 
ee ee EE et? < —_ 
B, 2 =%, = exp (—2a). (42) 





Likewise if we go from 46, to 46;,,; we decrease by exp (— 2a). Finally, 
a “transition” from 460; as a bound to 4y;,, decreases the bound by a 
multiplicative factor of (1 + a)” exp (—2a). To see this we note that, 
by assumption, 


— : 2 
B, = Hemel say (43) 


Using the left-hand equality in (48) implies 


a(B;)* 
HEN + Vy, |. (44) 


| Yai | expa S 
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while B; 2 4y% yields 


yh 
ly | = Se. (45) 


Using (45) in (44) then allows us to deduce that 
Byar = 4y34, S (1 + a)’ exp (—2a)B; (46) 


as was Claimed. To extend these arguments to the nonlinear case we 
again observe that 

OB; 

aS 

OU s+1 a : (47) 


for saturation arithmetic. 


VI. GENERALIZATIONS TO OTHER STABLE NONLINEARITIES 


Aside from the three nonlinearities already mentioned, there does 
not appear to be immediate engineering interest in seeing which other 
nonlinearities will or will not give rise to stable behavior of the filter. 
Having come this far, however, it is hard to resist asking if the method 
of proof we have used, or some slight extension of it, does suggest other 
nonlinearities for which stability will hold. The extension we consider 
is not to require 

OB; 
ayi =0 
all during the ‘“‘squeezing”’ operation, but merely that 
BY — Bo =0, (48) 
where B% is the value of the bound using linear theory and BS is the 
“correct”? value. An inspection of the previous proofs shows that this 


is equivalent to 
(Yi11 — AYi)” — ys. — ays)” > 0 (49) 


for all real a such that |a| < 1. 
A little manipulation reduces (49) to 


isi os Yevr Your + ee _ 2ayx) = 0. (50) 


Assuming yz,, > 0, the first term in (50) to be nonnegative, and | y, | 
< 1, makes it apparent that 


Yar + Yer 2 2 (51) 
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is sufficient. The “stable nonlinearities” deduced from this kind of 
reasoning are outlined in Fig. 9. Thus any nonlinearity whose graph 
coincides with the identity function on the interval [—1, 1] and whose 
remaining portions lie in the closed shaded region of Fig. 9 will be stable. 
The function in these regions need not be continuous and need not obey 
f(—u) = —fl). 

An even higher degree of generality is achieved when we realize that 
nothing in our proofs required the nonlinearity f(w) to be the same for 
successive values of the parameter k. This is tantamount to allowing the 
nonlinearity to be random in the following manner. Suppose a value of 
Yx., > 1 has been predicted from linear theory (see Fig. 9). The per- 
pendicular P to the v axis through y7Z,, intersects the shaded region 
shown in Fig. 9 along a line segment. Choose randomly from this line 
segment the ‘‘value”’ of the nonlinearity to give y;,, . The discussion in 
this Section shows that the solutions of the difference equation 


Yrro = flayns. + bys] (52) 


which has the stochastic nonlinearity just described will be stable when- 
ever the linear version has stable solutions. 


APPENDIX 


Derivation of the Steady-State Solution 


We obtain the steady-state solution of our fundamental equation (6) 
using z-transforms. Recall that if one has a bounded sequence of number 
{a,}, the z-transform is defined by 


f(@ = > ane (53) 


where (53) converges and is analytic outside the unit circle, |z| > 1. 
It is easy to show that if {a,} is periodic of period N, that isif ay,, = d,, 
then (53) becomes 
1 
Ava(?) 


LC eames (54) 
1—z 
where Ay-_, is the polynomial of degree (V — 1) in 1/z given by 


Agel 1) > One”. | | (55) 


The N poles of f(z) at the N roots of ay are apparent from (12), and 
there are no other poles. — 
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f (v) 


Fig. 9— Any nonlinearity whose graph coincides with the identity function on the 
interval [—1, +1] and whose remaining portions lie in the (closed) shaded region will 
be tebe The possibility of generalizing this to a stochastic nonlinearity is also noted 
in the text. 


RS 


Vv 





Denoting by Y(z) the z-transform of y(t) excluding the additive terms 
involving initial conditions (since these will damp out because of linear 
stability) we have from (6) that 


Y@) = (56) 


@ —az— db —2%) 
The z-transform of the steady-state solution Y(z) must still be ex- 
tracted from Y(z). Since the unit circle |z| = 1 corresponds to the 
frequency axis if one were using Fourier transforms, we know, by anal- 
ogy, the state steady-state portion of (56) will be the pole-terms. Let 


r;,t =1,---,N be the N Nth roots of unity and define 
i 1) = N-1 (Ayr = 1 —2z -N 
Q; (1 ors dX r; 2 1 a 1 ‘ (57) 
r, @ 


Note (57) implies 


of(2) ae (58) 


(59) 
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where we have let 
D(z) = 27 — az — b. (60) 
Using (57) once more, the steady-state solution (59) may be written 


ee aaa ( Jar (2) 


eget N Ge r:D(,) 








Y(2) = (61) 


Referring back to the discussion at the beginning of this section, we see 
that (61) is the z-transform of a sequence {y,} of period N where 


is , Ar (\or(4)| 





‘ : r; 2 
y, = coefficient of z~* in Fao Qu > Der) { 
k=0,1,---,N-—1. (62) 
Using (57) in (62) we obtain 
a), 
1 N Ay-1 a 


where, in writing (63), we have used the fact that rY = 1. Expression 
(63) thus gives the {y,} sequence for any click sequence. It is a solution 
corresponding to a self-sustained oscillation of the digital filter only if 
we have | y, | < 1, all k. 

Two sums appear in (63). The explicit one shown is the sum over the 
roots of unity; the hidden one is the polynomial Ay_,(1/r;). We will 
exhibit another form of solution (63) by explicitly doing the sum over the 
N roots. We begin by writing 


-1 
Anal 1) 2, p. = £1, 0. (64) 
=0 v; 
Thus 7, are the coefficients, except for the factor of 2, of the polynomial 


Ay-1(z). We also write, by factoring D(z) and expanding in partial 
fractions, 





1 1 = 1 1. ch a a 
D@ 7 ( — pi)@ — pr) Pi — Pe F — pi 2 — -| 409) 


Now note that if z is such a number than z” = 1, we have (since | p| < 1 
and |z| = 1) 
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1 1 fea) n 

-25(e). (66) 
Let us look at the sum of the n = 0, N, 2N, etc., terms in the right side 
of (66), that is 





2 3N 
p 


p” vp 
l+ y+ ow + aw te 
2 Zz z@ 





= 1 + py up a + p ae ae we 1 -. (67) 


Treating the sum of terms 
n=1,N+1,2N+1,-:-- 
2,N+2,2N + 2,--- 


n 


n=eN HN OW Hi), ON EW = Dy oe 


similarly, we have 











: <1) fete heen etal: (68) 
oe ee on ee Ze gN7 
Finally letting z = 1/r; gives 
1 ro “ 
pon = ts Elon (69) 
Oe 1 — p n=0 
te 


Using (65) and (64) in (63) yields 


1 2 N-1 
Un = “77 SxS) 


Pi — po N i t=0 1; 


[253 ( pi yas p> ie (70) 
Pe foots AD = py Lo) pa 
Two sums in (70) are immediately done. First look at the sum over the 


roots of unity. This involves observing that 


Eero if k-1—1—n=0 mod N, 








(71) 
0 otherwise. 


The congruence indicated in (71) can only be satisfied here if 1 = k — 
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1—norifl=k—1—n+N. Thus it is useful to define 
2b” = Gn-1-n + Gk-1-n+N » (72) 


where we understand d; = 0 if 7 does not lie between 0 and N — 1, in- 
clusive, and d; = a; if it does. One of the @’s in (72) will thus always be 
zero and b\? has values, like the p’s, of +1, 0. Using the discussion above 
surrounding equations (71) and (72) we perform next the sum over | 
and write another form of the solution: 


9 N-1 n n 
Wy = = > us| Pi P2 | 


Pi — P2 n=0 dg ope 
k=0,1,---,N-—1. (73) 
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Rate Optimization for Digital 
Frequency Modulation 


By J. E. MAZO, HARRISON E. ROWE, and J. SALZ 
(Manuscript received June 12, 1969) 


The data rate of a multilevel digital FM system is optimized subject 
to fixed RF bandwidth, signal-to-noise ratio, and output error rate. The 
possibility of optimizing such a system was first considered by J. R. 
Pierce at Bell Telephone Laboratories. He made the observation that tt 
ts possible to send many levels slowly or fewer levels rapidly for an FM 
wave of fixed RF bandwidth and error rate, and that there must be a choice 
of signaling rate and number of levels that optimize the data rate. The 
rigorous treatment of this problem is the subject of this paper. The mathe- 
matical model we analyze uses frequency-shift keying at the transmitter 
and ideal discrimination detection with an integrate-and-dump circuit as 
the post-detection filter. Our results are exhibited graphically showing the 
various dependencies among the pertinent system parameters. 


I. INTRODUCTION 


In this paper we optimize the information rate (subject to certain 
constraints) of a multilevel digital FM system. This problem of 
delivering the maximum information through an FM system has 
recently been formulated by J. R. Pierce.’ Specifically, he considered 
how one should choose the baseband signaling rate and the number 
of levels to get the most information through the channel, subject to 
fixed bandwidth, fixed RF signal-to-noise ratio, and fixed output error 
rate. This optimization has recently been carried out under the assump- 
tion that the conventional FM receiver can be linearized.” Small-noise 
linear FM theory is satisfactory when analyzing analog systems, but 
has its well known pitfalls in digital applications. 

The purpose of this paper is to reexamine this problem more rig- 
orously, paying particular attention to the anomalies (clicks) which 
can result from the nonlinear character of the receiver. In order to 
do this we must choose a particular mathematical model for digital 
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FM which is amenable to analysis. Such a model uses frequency-shift 
keying (FSK) at the transmitter and ideal discrimination detection 
with an integrate-and-dump circuit as the postdetection filter. The 
noise at RF is assumed to possess gaussian statistics. Although realizable 
I'M systems do not exactly conform to this ideal mathematical model, 
we feel that the results predicted with the use of this model are applic- 
cable to real FM systems. In any case, the numerical results agree 
well with those derived from the linear theory. According to our present 
calculations, this is due to the circumstance that the optimum number 
of levels leads to small enough deviations so that the contribution 
of the clicks to the error rate can be neglected. 


II. ANALYSIS 


Consider an n-level FSK communication system with a sample rate 
N = 1/T, square-wave modulation, and a level separation (in frequency) 
Af. Such a system would yield a data rate R given by 


R = N log, n = 1.448 N In n bits/s, (1) 
and, according to Carson’s rule, occupy a bandwidth* 
B=N+(n— I)Af. (2) 


The FM signal plus gaussian noise enters a receiver consisting of an 
ideal RF filter (bandwidth B), limiter, discriminator, integrator (in- 
tegration time 7), and sampler (sampling rate N). The sampler out- 
puts are simply the successive values of the instantaneous phase of 
the modulated wave following each (rectangular) modulation pulse, 
and would be separated by multiples of 

Ad = on SE radians (38) 
in the absence of noise. 

The simplicity of the present system (that is, the finite-time integrator 
post-detection filter) has permitted a fairly rigorous determination of 
the probability of error for high RF signal-to-noise ratio.* It is shown 
in Ref. 4 that the parameter A¢ given in equation (8) plays a very 
important role in the theory of error rates for digital FM. In particular, 
it is known that if A®@é < 7m (or equivalently, Af/N < 4), then it is 
the smooth noise at the baseband output which determines the error 


* Comparison with the exact FSK spectra for n = 2, 4, 8 suggests that this 
approximation is valid for present purposes, ® 
Pp P purp 
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rate; while if Aé > a (Af/N > 34), then the clicks dominate, which 
is the basic reason for the probability of error taking on different forms 
in these two cases. 

The optimum systems considered here are shown to correspond to 
the Af/N < #4 case, for which clicks are unimportant. Therefore we 
take the probability of error* P as given by twice equation (17a) of 
Ref. 4, with ¢ — Ad/2 = a Af/N; 


,(ZAl 
P maa exp | - 2 sin” (z 4) | ; 


p> |, ” < ‘ ; 
and subsequently verify that Af/N is indeed less than 3 for the re- 
sulting optimum systems. Here p is the RF signal-to-noise ratio in 
the frequency band B. We treat the asymptotic approximation (for 
large p) of equation (4) as an equality in the following. 
For fixed error rate P and RF signal-to-noise ratio p, equation (4) 
determines Af/N. Rewriting equation (2), 


(4) 


B_ _ Af. 
ywait@-Dy: (5) 
substituting equation (5) into equation (1), 
: = eT bits/cycle. (6) 
l+a@—-Dy 


We set the derivative of equation (6) equal to zero, determining the 
optimum number of levels n) and maximum rate Ry . 


n(In nm) — 1) = aa — 1. (7) 
Ro 1.443 
B~ n/N) 


Alternatively, once the optimum number of levels 7) has been de- 


* For multilevel output samples, most errors will be to adjacent levels. Assuming 
that something like the Gray code is used, the symbol probability of error P of 
equation (4) will be approximately the bit probability of error for the final recon- 
structed binary signal. 
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termined via equations (4) and (7), we may express the other parameters 
of the (optimum) system in terms of 7, only: 


Af _ 1 

N ~ mo(In Mm —1)4+1’ (9) 
Ry 1 : 

— = 1.443 | NM += — | bits/cycle, (10) 
B No 

B_ No IN No ; 

N ni(nn —- 1 +1 . (11) 


Note that the restriction Af/N < } implies via equation (7) that 


Ny = 4. (12) 
Finally, the Shannon capacity for the RF channel is 
5 = 1.443 In (1 + p) bits/cycle. (13) 


III. RESULTS 


Figures 1 to 7 illustrate the parameters of optimum multilevel 
FM systems using a finite-time integrator as a post-detection filter 
for two representative error rates (P = 107°, 10°). 

The solid curves of Fig. 1 show the optimum number of levels 7) 
versus the RF signal-to-noise ratio in dB, 10 logiy p, for the two values 
of P. The curves terminate at n) = 4, according to equation (12). 
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Fig. 1— Number of levels for maximum data rate versus RF signal-to-noise ratio. 
Dashed lines indicate small-angle approximations. 
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Fig. 2 — Bandwidth expansion factor for maximum data rate. 


Mo increases rapidly as p increases, for fixed P. The small-angle approx- 
imation for the trigonometric functions in equation (4) is shown by 
the dashed curves of Fig. 1; in this approximation changing P simply 
translates the curves of Fig. 1 horizontally. This is a reasonable ap- 
proximation for the smallest m) permitted [by equation (12)], for the 
values of P of interest here. 

Figures 2, 3, 4, and 5 show optimum system parameters plotted 
against two horizontal scales: 


(2) 10 log,p—the RF signal-to-noise ratio in dB. Two plots are 
shown, for P = 10°, 10°*. Using the small-angle approximation in 
equation (4), changing P translates these curves horizontally. This 
horizontal axis is the parameter of most direct physical interest. 
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Fig. 3 — Maximum data rate per unit RF bandwidth. 
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¥ig. 4 — Relative phase shift per level in one sample interval for optimum systems. 


(it) No>—the optimum number of levels, determined from Fig. 1. 
Here a single universal plot suffices rigorously for all P [That is, without 
small-angle approximations in equation (4)]. 

The vertical axes show: 


igure 2—B/N, the bandwidth expansion factor, roughly* one-half 
the ratio of RF to base-bandwidth. This factor varies from about 2 at 
small p or n) , to an asymptotic limit of 1 as p,m) — ©. For large p, 
we have small-index phase modulation, with only the first sideband 
significant. Even for the smallest p, m) considered here the bandwidth 
expansion is moderate. 

Figure 3—R,/B, the normalized maximum rate in bits per cycle. 
This quantity increases monotonically with p, no. 

Figure 4—360- Af/N represents the relative phase change in degrees 
corresponding to a change in modulation of one level. 

Figure 5—360 (n — 1) Af/N represents the maximum relative 
phase change in degrees in one sampling interval, corresponding to a 
change in modulation from the lowest to the highest level. The maximum 
value for this quantity, occurring for the smallest p, n) (that is, n» = 4) 
is not far from 360°. As p, %» increase, the maximum phase change be- 
comes small for optimum systems. 

Within the small-angle approximation, discussed in connection with 
Fig. 1, changing P merely shifts the horizontal (dB) axes of Fig. 1 
and Figs. 2(a) to 5(a). Let us adopt the P = 10°° curves as standard, 


* This is because the square-wave modulation assumed here is not strictly band- 
limited; in fact, its spectrum falls off so slowly that its rms bandwidth is infinite. 
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Fig. 5 — Maximum relative phase shift in one sample interval for optimum systems. 


and plot the number of dB to be added to the 10 log, p axes a sa function 
of P. This is shown in Fig. 6. We remark that this is only an approxi- 
mation, and will begin to fail sooner as P decreases. 

Finally, Fig. 7 compares the maximum data rate for the multilevel 
FM system with the Shannon capacity of the RF channel. The optimum 
data rate ranges from about 19 to 27 percent of the ideal RI channel 
capacity, for error probabilities P between 107° and 10°. 

We have so far dealt with optimum systems. However, the number 
of levels may be fixed by other constraints, so that suboptimum systems 
are of interest. For example, it may not be practical to have the large 
number of levels required for optimum systems at large RF signal-to- 
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Fig. 6 — Correction for modifying P = 10° curves to other error probabilities. 
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Fig. 7 — Ratio of maximum data rate to Shannon capacity. 


noise ratios p; we may be restricted to 8 (or 16) levels, and it is necessary 
to determine how much the data rate will be reduced. Now rather than 
maximizing R by varying N and n in equation (1) subject to the con- 
straints of equations (2) and (4), we fix m in equations (5) and (6). 
Figures 8 and 9 show the optimum rate R,/B versus 10 logy p [given 
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Fig. 8 — Best data rate for suboptimum systems with two, four, and eight levels 
compared to maximum data rate for optimum system. Dashed line—maximum data 
rate for optimum system, 2»o/B (see Fig. 3). 
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Fig. 9 — Best data rate for suboptimum systems with two, four, and eight levels 
compared to maximum data rate for optimum system. Dashed lined—maximum 
data rate for optimum system, Ro/B (see Fig. 3). 


also in Fig. 3(a)], together with the rates for two, four, and eight levels, 
determined from equation (6) with n = 2, 4, and 8 for P = 107°, 10° in 
equation (4). While eight levels is strictly optimum only at the point 
of tangency between the R, and the R, curves, we see that the optimum 
is fairly broad. The corresponding bandwidth expansion factors are 
found from equation (5). 


IV. DISCUSSION 


We have presented the results of Figs. 1 through 9 as continuous 
curves. Actually, only isolated points of these curves are significant, 
since the number of levels must be integral. These continuous curves 
should consequently be replaced by appropriate “‘staircase’’ functions, 
but the difference will be significant only for small numbers of levels 
(that is, at low RF signal-to-noise ratios). 

The present theory excludes two- and three-level systems. Naively, 
one might try to extend the present results to these cases by equation 
(17c) and Fig. 5 of Ref. 4. This may not be accurate for the error rates 
considered here (P = 107°, 107°), because the RF signal-to-noise ratio 
p becomes small, and the basic results of Ref. 4, that is, equations 
(17), (26), and (27), are asymptotic as p becomes large. However, for 
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very much smaller error rates, for example, P ~ 107°°, it is possible 
that this approach would be productive. 

It would be desirable to extend the present results to binary and 
ternary systems; this will require a different or improved approach 
from the asymptotic evaluation of Ref. 4 for the error probability. It 
seems likely that clicks will dominate the error behavior for optimum 
two- and three-level systems. 

The principal limitation in the present treatment (aside from the 
assumptions of the model, such as a finite-time integrator post-detection 
filter) lies in our lack of knowledge of the precise way in which the basic 
result for the probability of error P (equation (4) above) fails. We have 
merely assumed that this result holds for signal-to-noise ratios down to 
about 10 dB, independently of P or Af/N. This provides additional 
motivation for further study of the asymptotic theory of Ref. 4. 
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Power Spectrum of Hard-Limited Gaussian 
Processes 


By HARRY M. HALL 
(Manuscript received September 10, 1968) 


The power spectral density at the output of an ideal hard limiter (one- 
bit quantizer) is examined when the input is driven by a narrowband gaus- 
sian signal plus an additive gaussian noise that consists of a broadband back- 
ground component plus narrowband interference. Assuming that the input 
signal-to-noise power ratio ts small by virtue of the large bandwidth of the 
observed broadband noise, calculations are made of the average output signal 
power, the average output noise power in the signal band, and the average 
power of the strongest intermodulation product. The results support the 
intuitive conclusion that spectrum analyzer performance is degraded by 
the presence of the limiter and that this degradation 1s more pronounced 
when a strong narrowband interfering signal ts present. They also indicate 
that the degradation can be minimized by making the bandwidth observed 
by the limiter sufficiently wide that the broadband noise power dominates 
both the signal and interference powers. In particular, for a typical example, 
the signal-to-noise power ratio measured in the signal band is degraded by 
less than about 1.3 dB by the presence of the limiter and the ratio of output 
signal power to power of the strongest intermodulation product is greater 
than about 14.5 dB as long as the broadband noise power exceeds the inter- 
fertng-signal power. 


I. INTRODUCTION 


In this paper we examine the power spectral density at the output of 
an ideal hard limiter when the input is driven by a collection of inde- 
pendent gaussian processes. This work is motivated by the fact that in 
spectrum analysis, it is often convenient from the point of view of signal 
processing to precede the analyzer with a hard limiter. In order to deter- 
mine the effect of the limiter on analyzer performance, it is of interest 
to compare the power spectral density at the limiter output with that 
at the limiter input. With this goal in mind, the ideal limiter to be ana- 
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lyzed is shown in Fig. 1. It is assumed that the limiter input is driven 
by the signal 


x(t) y(t) 





Fig. 1 — Ideal hard limiter. 


a(t) = st) +n), (1) 


where s(¢) is a sample function of the gaussian ‘‘signal’’ process S(¢) and 
n(t) is a sample function of the gaussian ‘‘noise’’ process N(¢). More 
precisely, it is assumed that S(t) and N(é) are statistically independent, 
zero-mean, stationary, real, gaussian processes having continuous co- 
variance functions Rs(r) and Ry(r) respectively. Further, motivated by 
the spectrum analysis application, the covariance functions R (7) and 
Ry(r) are specified: the signal process S(t) is assumed to be a narrow- 
band process with covariance function 


Rs(r) = R,(r) cos wor (2) 


where S,(f), the Fourier transform of R,(r), occupies a narrow band 
centered at zero frequency. The noise process V(t) is assumed to consist 
of a broadband background component plus narrowband interference 
that is statistically independent of the background noise. The covariance 
function of the broadband background noise is assumed to be a continous 
covariance function that is given in the form! 


Ry(r) = Ri(ri ; 7) 
= © (121) cos xr, (3) 


Ty Ti 


where p(x) satisfies the conditions 
p(0) =A Ly (4) 


[ Vol) |ax < (5) 


This specification of R,(r) has the properties: 


t For example, consider the exponential covariance 


Rig(r) = Co exp (-» zt) COS wT. 
1 


T1 
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(t) The total average broadband noise power #,(0) increases linearly 
with 77’ where 7, > 0 is defined to be the broadband noise “correlation 
time.”’ 

(it) The average broadband noise power observed in any fixed band 
of finite extent approaches a finite constant as the correlation time 7, 
approaches zero. 

Finally, the covariance function of the narrowband interference is 
assumed to be given by R.(r) cos w.7 where S.(f), the Fourier transform 
of R.(r), occupies a narrow band centered at zero frequency. Therefore, 
the covariance function of the noise process N(t) is given by 


Ry(r) = R,(r) + R2(r) cos wer (6) 


where F,(r) satisfies equation (3). 

It was stated that the covariance functions just specified are suggested 
by the spectrum analysis application, and this is true in the following 
sense: it is often the case that one desires to analyze narrowband signals 
that lie at a priort unknown locations within a relatively wide band, and 
in fact it may be that the total bandwidth to be searched is a significant 
fraction of the band center frequency. Given such a spectrum analysis 
problem, it is proposed that the situation of greatest interest is that in 
which the average noise power in the narrow band actually occupied by 
the signal may or may not be comparable to the average signal power, 
but in which the total average noise power is much larger than the aver- 
age signal power by virtue of the large noise bandwidth. Having such a 
situation in mind, it is seen that the model for the broadband covariance 
function R,(r) specified in equation (3) does in fact exhibit the desired 
behavior when the correlation time 7, is appropriately small. 

However, in addition to this “‘weak-signal’”’ situation in which the 
narrowband signal power #s(0) is much smaller than the broadband 
noise power #,(0), it is also of interest to allow the presence of ‘‘strong’’ 
narrowband signals whose average power is comparable to that of the 
broadband background noise. The presence of such strong narrowband 
signals is expected to be obvious at the limiter output, and in fact these 
signals are of interest since we expect that their presence will lead to the 
generation of intermodulation products that may interfere with the 
analysis of any weak signals that are present. In order to examine this 
situation, a narrowband interference has been included, and it is con- 
venient to consider this interfering signal to be part of the additive 
noise V(t). 

Before proceeding with the analysis of the problem stated above, it 
is noted that the ideal limiter described in Fig. 1 has received a great 
deal of attention in the literature. The noiseless case has been considered 
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and output amplitudes examined when the input consists of a collection 
of sinusoids.’’? The noise-alone case has been examined and results ob- 
tained for the autocorrelation function and power spectral density at the 
limiter output both for the case of broadband gaussian noise alone 
[2,(r)] and for the case of narrowband gaussian noise alone [R,(r) 
COS we7].°'* The ratio of output signal-to-noise ratio (SNR) to input SNR 
has been evaluated for the case in which the input consists of one or two 
sinusoids plus narrowband gaussian noise.°~’ These same workers have 
examined the strengths of intermodulation products, and the analysis 
of output signal and intermodulation product power has been extended 
to the case of an arbitrary number of sinusoids plus gaussian noise.*’® 
In addition, analysis of the limiter has played an important part in 
studies of the performance of angle-modulation systems, and these 
analyses have generally assumed that the limiter is driven by a narrow- 
band process. 

On the other hand, it does not appear that much has been reported 
for the situation in which the limiter is driven by a narrowband signal 
plus noise that includes a broadband component. Known results that 
have application to this situation include those of Manasse, and others, 
which apply when the limiter is driven by a “‘weak”’ narrowband signal 
plus narrowband gaussian noise whose bandwidth is much larger than 
that of the signal,’° plus approximate results that apply when the input 
includes a narrowband component that is ‘‘much stronger” than the sum 
of the other inputs present.’ We address this problem by examining the 
the output power spectral density when the limiter input is given by 
equation (1); namely, the input is made up of a narrowband gaussian 
signal plus a gaussian noise consisting of a broadband background com- 
ponent plus narrowband interference. In particular, this examination 
is carried out by calculating the output power spectral density in Sec- 
tion IT, as the broadband noise correlation time 7, approaches zero. 
This calculated result is then used in Section III to evaluate three 
performance measures. An example of a system to which these per- 
formance measures apply is a spectrum analyzer preceded by the ideal 
limiter. 


(t) The degradation in the ratio (SNR) of average signal power to 
average nolse power in the spectral band occupied by the signal is 
calculated. This degradation is important because the signal-to-noise 
power ratio measured in the signal band is often one of the important 
parameters in determining system performance. 

(7) The ratio (SIR) of average output signal power to average image 


POWER SPECTRUM 30385 


power is calculated where, if the narrowband signal is centered at a 
frequency f) and the narrowband interference is centered at a frequency 
f. , then the signal image is defined to be that intermodulation product 
centered at the frequency | 2f. — fo |. This is the strongest of the inter- 
modulation products of the signal with the additive noise, and thus it is 
reasonable to use the SIR as an indication of whether or not these inter- 
modulation products will have a significant effect on system perform- 
ance. 

(212) The ratio S,.NR, of average output interference power to aver- 
age output broadband noise power in the spectral band occupied by the 
interference is calculated. As discussed previously, the distinction in 
this work between signal and interference is made based upon average 
power at the limiter input. That is, it has been assumed that the presence 
of any narrowband signal having an average power comparable to that 
of the broadband background noise will be obvious at the limiter out- 
put, and that such an input may in fact interfere with the analysis of 
other narrowband inputs. S,NR, is calculated to check the assumption 
that in fact the presence and location of such an interfering signal will 
be obvious upon analyzing the power spectrum at the limiter output. 


Since the performance measures listed above are calculated as the 
broadband noise correlation time 7, approaches zero, it follows that they 
will all apply in practice to situations in which the broadband component 
of the input noise has been shaped by a low-pass filter whose bandwidth 
is large compared with the center frequencies of the narrowband inputs 
that may be present. An example of a situation in which such a model is 
viable occurs in the spectrum analysis of underwater acoustical signals. 

On the other hand, the SNR and 8,NR, results obtained will not apply 
directly to communication situations in which the bandwidth of the 
additive broadband noise is much larger than that of the narrowband 
signal but much smaller than the system center frequency. This situa- 
tion is discussed in Section IV, and it is pointed out there that the results 
can be modified to encompass this situation by letting the center fre- 
quencies of both the narrowband signal and additive noise increase 
linearly with 7;*. 


Il. THE OUTPUT POWER SPECTRAL DENSITY 


The output power spectral density can be calculated by using the 
expression for the output autocorrelation function Ry(r) given by 
Davenport and Root (Ref. 12, p. 308) 
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Ay rte + m)/2] k(.) pm : 
Ry(1) = ae» TE ml Ra) RO] PsA), k + m odd 


dB otherwise, (7) 


where I(x) denotes the gamma function; in conjunction with the ex- 
pression for Ry(r) given by Van Vleck (Ref. 3, p. 23) 


Defining a to be the fraction of the average noise power due to the 
broadband background noise, 


Ry) R,) +. z R2(0) ’ 


it is seen that the ratio ys of average signal power to average noise power 
at the limiter input is given by 


a RO) _ Rs , 
Ns Ry (0) Co 1. 


Now, it was pointed out in Section I that we are interested in the situa- 
tion in which the signal-to-noise power ratio ys is small, and in fact 
the case of interest is that in which 7, is small because 7, is small, that is, 
ns 18 small due to the large bandwidth of the observed broadband back- 
ground noise. Motivated by this, it is shown in Appendix A, using 
the expressions for Ry(7) given by equations (7) and (8), that when 
a > 0 the output power spectral density Sy(f) is given by 











a 


(10) 


Sy) = va arcsin py(r) cos wr dr 
+ HAO) 5, [ Ipste) — (1 = a) pals) c08 


[1 — (1 — a@)’pa(z) cos” wr]? cos wr ar) + o0(7,) (1) 


as 7, — 0, uniformly in f, where 


(7) A R(t) 
Py R, (0) ’ 
are assumed to be absolutely integrable. 


Equation (11) exhibits the components that dominate the output 
power spectral density when the broadband noise correlation time 7; 


y = 8,N,0, 1, 2, (12) 
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approaches zero. In particular, inspection of equation (11) shows that 
these dominant contributions include a component that is just the out- 
put power spectral density observed when the noise N (¢) alone is present 
at the limiter input, a component that has the spectral characteristics 
of the signal S(é), and a component that is due to interaction of the 
signal with the interference component [p.(r) cos w27] of the noise. In 
order to quantitatively analyze these components where, in particular, 
we desire to use Sy(f) to calculate the performance measures discussed 
in Section I it is convenient to make use of the fact that both the signal 
S(d) and the interference component of the noise have been assumed to 
be narrowband processes, plus the fact that the broadband component 
of the noise becomes white across any fixed band of finite extent when 
7, — 0. These properties can be exploited by expanding both [1 — 
(1 — «)*p2(r) cos wz]? and arcsin py(r) followed by an appropriate 
collection of terms. This is carried out in Appendix B and the result is 


Sr(f) = Sri) + Sr.) + Sat yD en Fen l — 0)” 


| oFi{m + 4, m + 4; 2m + 1; (1 — a)’ p2(7)] 


-ps(r)p2"(r) cos 2mw.r cos wr dr 


8 Rs) ALm+t Pm +8) somes 
Lage Cs, mB T(2m + 2) (l — @) 


[Film + 8, m + $5 2m + 25 (1 = 2)*0()] 
0 


+p.” (7) cos (2m + 1)wer cos wr dr + o(7:) (13) 
as 7, — 0, uniformly in f, where F(a, b; c; x) is Gauss’s hypergeometric 
function (Ref. 18, p. 556), em is the Neumann factor ¢ = 1, én = 
2(m = 1, 2, ---), and where Sy, (f) and Sy,(f) are given: 

4 
ey & 


T 


Srl) = 2 n [” faresin fap(2) + 1 — a] 


— arcsin (1 — a)} dx + o(7:) (14) 


as 7, — 0, for all f < fuss < © for arbitrary fixed f,,..‘ and 
t Recall from equation (3) that 


pi(r) = (| 2 ) COS w)T 


T1 


where p(x) satisfies equations (4) and (5). 
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8.) = 4 > Ee EB a — aye 


[Paton + 4, m $45 2m + 25 (1 — 0) 


-p2”**(7) cos (2m + 1)wer cos wr dr. (15) 

The expression given by equations (13), (14), and (15) exhibits in a 
useful fashion the components that dominate the output power spectral 
density when the broadband noise correlation time approaches zero. To 
see this more clearly, it is convenient to assume that the narrowband 
interference in fact has a line spectrum, that is, 


po(r) = 1. (16) 


This assumption is convenient since it simplifies the calculations with- 
out obscuring the most important effects that result from the presence 
of narrowband interference. This assumption is applied in Appendix B 
to equations (13), (14), and (15), and it is shown that, when p.(r) = 1, 
we can write 


lim $1) = Sv.) + Sx) + oes ow OED 


‘2l’,[m + 3, m + 3; 2m + 1; (1 — a)’ 


(is Fi 


“[Ss(f — 2mf2) + Ss(f + 2mf.)] (17) 
where Sy,(f) is given by equation (14), 
Sr.) = 7 EER a — yee 


“oF, [m + 4, m + $3; 2m + 2; (1 — a)’] 
{olf — 2m + Dfe] + d[f + (2m + 1)f.]} (18) 


where 6(x) denotes the Dirac-delta function, and where 
Ss) = 2 | Rs(r) coswr dr (19) 
Go 


is the power spectral density of the signal S(é). Equations (17), (14), 
and (18) give the representation we desire, and they demonstrate that 
there are three contributions that dominate the output power spectral 
density when the broadband noise correlation time 7, approaches zero. 

(i) There is a component Sy,(f) that becomes white across any fre- 
quency band of finite extent as 7; — 0. When a = 1, this component is 
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just the output power spectral density that would be observed if the 
broadband component of the noise was present alone at the limiter 
input. Moreover it is necessary to specify the broadband covariance 
function R,(r) in order to calculate Sy,(f). For example, if R,(r) is the 
“triangular” covariance function 


Ria(r) & S2(1 — Le) lr] sn 


T1 T1 
20, [rl >on, (20) 
then equation (14) gives the result 


Syia(f) = $1 2 — aresin (1 — a) — (a — | + o(7) (21) 


a 2 
as 7, > 0, for all f S fax < © for arbitrary fixed faux - 


(ti) There is a component Sy,(f) consisting of line spectra located at 
|f| =kf.,k = 1,3,---.Whena = 0, this component is just the power 
spectral density that would be observed if the narrowband interference 
was present alone at the limiter input. 

(iz) There is a component consisting of a term that has the spectral 
characteristics of the signal plus terms that are intermodulation products 
of the signal with the narrowband interference component of the noise. 


2.1 Noise Consisting of Broadband Component Alone 


It is clear from inspection of equations (17), (14), and (18) that 
the output power spectral density is greatly simplified when the additive 
noise consists only of the broadband component (a = 1), and in fact 
it is seen that in this case equation (13) reduces to the simple result 


Sy(f) = Sr) += Q a ~ Ss(f) + ofr) (22) 


as 7; — 0, uniformly in f. Moreover, the calculation of Sy, (f) is simpli- 
fied when a = 1. For example, if R,(7) is given by the triangular function 
in equation (20), then it is seen that, when a = 1, 


T1 


arcsin (1 _ x) cos wr dr. (23) 


Ty 


Sra(f) = ai 


This integral can be evaluated using Erdelyi [Ref. 14, item 4.8(1)], 
and we find 


Srialf) = Or | dolor) sine Oia = Helen) ne on | (24) 
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where J,(x) denotes the Bessel function of the first kind of order vy and 
H,(z) is a Struve function of order v (Ref. 14, p. 3872).t Note that equa- 
tion (24) holds for all f and for all 7, . Syi,(f) is plotted in Fig. 2 along 
with 


Sia(f) = Co sinc” (fri), (25) 


the power spectral density at the limiter input corresponding to R,,(z). 
The plotted data are normalized so that both processes have the same 
average power. Thus the data plotted in Fig. 2 show explicitly how the 
ideal limiter redistributes the average broadband noise power across the 
band and demonstrate in particular the power-spreading effect that 
takes place due to the limiter nonlinearity. 


III. EVALUATION OF PERFORMANCE MEASURES 


It is now desired to use the output power spectral density results 
derived above to evaluate the performance measures discussed in Sec- 
tion I. These calculations use directly the results derived above except 
that the assumption that the narrowband interference has a line spec- 
trum can be relaxed. That is, the results derived below continue to be 
useful as long as the interference is a narrowband gaussian process with 
the covariance function R.(r) cos wr specified in Section I. 


3.1 Degradation in Signal-to-Noise Power Ratio 


The degradation in signal-to-noise power ratio in the spectral band 
occupied by the signal is obtained by calculating the ratio SNR,./SNR, 
of output signal-to-noise power ratio to Input signal-to-noise power 
ratio, where these SNR’s are calculated in the spectral band B occupied 
by the signal. Moreover, we assume that: 


(t) The band B contains significant contributions from only the 
narrowband signal and the broadband component of the noise, that is, 
the narrowband interference and intermodulation products of the 
narrowband signal with the narrowband interference have negligible 
power in the band B. 

(iz) R,(r) is the triangular function in equation (20) since it is neces- 
sary to specify the covariance function of the broadband component of 
the noise. 


Making these assumptions, the ratio SNR,/SNR,; measured in the 


+ Note that sine x 4 S10 7% 
wx 
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Fig. 2 — Normalized power spectral density. 


pee Sy,a(f); Sia(f) = Co sine? (fr1). 





—™ 
Sy,a(f) = 


band B of finite extent can be calculated using Sy(f) given by equation 
(17), and it is seen that 


snr, | Se of f 8) af 
co a ae ae (26) 
. [ s.oat f ssa at 


where Sy,(f) is given by equation (21), Ss(f) is the power spectral 
density of the narrowband signal S(t), and Sy,(f) and S,(f) are given: 
Sy;(f) 1s defined to be the contribution to Sy(f) that has the spectral 
characteristics of the signal S(¢) and thus is determined by setting 
m = 0 in the sum in equation (17). This gives 


Sys) = = & aPilb. 4515 0 — 0)" (27) 


which (using p. 387 of Ref. 14) can be written as 
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Sra) = “3 KU — o) Soff) (28) 


where K(k) denotes the complete elliptic integral of the first kind. S,(f) 
is defined to be the power spectral density at the limiter input due to 
the broadband component of the noise and thus, using equation (25), 
is given by 


Sif) = Cy + O(7}) (29) 
as tr, > 0, for all f S faax < © for arbitrary fixed f,,.. . Thus, making 
the appropriate substitutions into equation (26) yields 


lim SNR, &’K(1 — a) 


SNRs” Te (80) 
Tv 





oom arcsin (1 — a) — (2a — 2) | 
This relative signal-to-noise power ratio result is plotted in Fig. 3 and 
demonstrates the expected result that the degradation in the signal 
band increases when there is a strong narrowband interfering signal 
present at the limiter input. However, it is important to note that the 
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Fig. 3 — Relative signal-to-noise power ratios. 
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narrowband interference must be very strong to cause a significant in- 
crease in the degradation. In particular, it is seen that the degradation 
is less than about 1.3 dB as long as a is greater than 0.5, that is, as long 
as the broadband noise power is greater than the narrowband inter- 
ference power. 


3.2 Signal-to-Image Power Ratio 

The signal-to-image power ratio (SIR) is obtained by calculating the 
ratio of average output signal power to average image power where the 
image has been defined to be that narrowband component of Sy(f) 
centered at the frequency | 2f. — fo |. The SIR can be calculated using 
Sy(f) given by equation (17), but it should be noted that, when 7, — 0, 
the SIR does not depend on the particular choice of R,(7) within the 
class specified by equation (8). Using equation (17), it is seen that 


[ Ses() af 
lim SIR = =3;———— 
™1770 
5 | Self af 
where Sy,(f) is given by equation (28) and Sy,(f) is found by setting 
m = 1 in the sum in equation (17). That is, 
1 a(1 — a)’ 
Syf) = ge IB, #535 — 0)"In 
[Ss — 2f2) + Ssf + 2f2)], (82) 


which, using Abramowitz and Stegun [Ref. 13, item 15.2.1] together 
with Price [Ref. 15, p. 10] and Dwight [Ref. 16, items 788.1, 788.2], 
can be written as 
~8 @ |1l+2%-e,, _ 3) ag | 
Sy(f) = rd aCe | 5 Kd — a) — Ed — a) |r, 
“[Ssf — 2f2) + Ss + 2fr)] (33) 


where E(k) denotes the complete elliptic integral of the second kind. 
Making the appropriate substitutions, there results 
; es (1 — a)’K(1 — a) ; 
Bo (enon eR a= On =o) 
This SIR result is plotted in Fig. 4 and demonstrates that the signal- 


to-image power ratio decreases when there is a strong narrowband 
interfering signal present at the limiter input. In fact, equation (34) 


(31) 


(34) 
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Fig. 4 — Signal-to-image power ratio. 


A R,(0) 
*~ Ri) + R20) 


has the limiting behavior 


lim lim SIR = 1, (35) 

a0 7170 
‘which agrees with the approximate result obtained when one assumes 
that the input to the limiter includes a narrowband component that is 
much stronger than the sum of the other input components present.” 
Tlowever, the most interesting result demonstrated by Fig. 4 is that 
the narrowband interference must be very strong for the image power 
to be comparable to the signal power at the limiter output. In particular, 
it is seen that the SIR is greater than about 14.5 dB as long as the broad- 
band noise power is greater than the narrowband interference power. 


3.3 Output Interference-to-Broadband Noise Power Ratio 


The output interference-to-broadband noise power ratio S,.NR, is 
obtained by calculating the ratio of average output interference power 
to average output broadband noise power, measured in the spectral 
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band B,' occupied by the interference. In order to perform this calcu- 
lation it is necessary to specify the broadband covariance function, and 
is is assumed that R,(7) is the triangular function in equation (20). 
Having specified R,(r) in this manner, S8.NR, can be calculated using 
Sy(f) given by equation (17), and it is seen that 


| su a 
lim S.NR, = 2—— (36) 
= [Se af 


where Sy,(f) is given by equation (21) and Sy,(f) is given by equation 
(18). Proceeding with these substitutions and making the assumption 
that the components of Sy,(f) concentrated at (odd) harmonics of the 
fundamental frequency f, contribute negligible power in the band B, , 
there results 


a(1 — a) oF,[$, 4; 2; (1 — @)’] 








lim S2N Ro = ' (37) 
= 2r,|% — aresin (1 — a) — Qa — “|(/ /) 
2 Bs 
which, making use of Price [Ref. 15, p. 10], can be written as 
J — _ — 2 a 
lim S,NRy = 2a[H(1 — a) — Qa — a&)K(1 — a)] (38) 
aes (1 — a) 5 — aresin (1 — a) — (2a — a”)! |W, 
where 
We] df. 39 
fa (39) 


The normalized power ratio lim,,.0 W7,(S2NR,) is plotted in Fig. 5, and 
the plotted data are seen to support the intuitive assumption made in 
Section I that the presence and location of a narrowband input having 
an average power comparable to that of the broadband background 
noise will be obvious at the limiter output. 

A result of perhaps more interest than S,NR, is the ratio S.NR,/ 
SNR, of output interference-to-broadband noise power ratio to input 
interference-to-broadband noise power ratio. This calculation can be 
carried out in the same way that SNR,/SNR, was calculated earlier, 
and we find 


t This calculation is not of interest if the interference truly has a line spectrum 
(that we can resolve). However, it is of interest here since these results are useful as 
long as the interference is a narrowband gaussian process, 
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Fig. 5 — Normalized output interference-to-broadband noise power ratio. 
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where S,, (f) is given by equation (21), S,, (f) by equation (18), S, (f) 
by equation (29), and 


(40) 


[ So(f) df = R,(0). (41) 


Making these substitutions and using the definition of a in equation (9) 
yields 

lim SaNRo _ 2a°[E(1 — a) — a — a’)K(1 — a)] (42) 

T1-0 S.N Ry o| 7 . 1 

m1 — a) E — aresin (1 — a) — (a — a’)? 

This relative (interfering) signal-to-noise power ratio result is plotted 
in Fig. 3 and is particularly interesting since the plotted data can be 
viewed as a plot of S;NR,/S.NR: versus the input interfering signal- 
to-total broadband noise power ratio S.N;R . That is, it is seen that 
the ratio of average input interfering-signal power to total average input 
broadband noise power is given by 
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A PR.(0) a ] mans Qe 
eae R,0) a 


With this interpretation in mind, the plotted data show that there is a 
degradation in signal-to-noise power ratio in the signal band at all levels 
of input signal-to-noise power ratio as 7; — 0, and that this degradation 
increases monotonically with increasing input signal-to-total noise power 
ratio. We note the contrast of this result to that found by Davenport 
for the case in which the limiter is driven by an unmodulated sinusoid 
plus narrowband Gaussian noise where he shows that there is an en- 
hancement in signal-to-noise ratio (measured in the narrow noise band) 
at high input signal-to-noise ratios.” It is also noted that the data plotted 
in Fig. 5 together with that in Fig. 8 show, that although the degradation 
increases monotonically with S.N,;R; , it does not increase as rapidly as 
W7r.i(S.NR;) itself is increasing. 








(43) 


IV. CONCLUSIONS 


This paper has concentrated on analyzing the power spectral density 
at the output of an ideal limiter when the input is driven by a narrow- 
band gaussian signal plus an additive gaussian noise that consists of a 
broadband background component plus a narrowband interference. 
Conclusions that can be drawn from this work depend upon the system 
in which the limiter is used, and one is led to the following conclusions 
when this system consists of a spectrum analyzer preceded by the ideal 
limiter: Spectrum analyzer performance will be degraded by the presence 
of the limiter, and this degradation can be substantial when there is a 
strong narrowband interfering signal present at the limiter input. This 
intuitive conclusion follows from the fact that the signal-to-noise power 
ratio SNR measured in the signal band may be significantly degraded by 
the presence of the limiter when there is a strong narrowband interfering 
signal present at the limiter input, plus the fact that intermodulation 
products of the narrowband signal with the narrowband interference 
may be troublesome as indicated by a decreased signal-to-image power 
ratio SIR. 

However, it is important to note that the results also indicate that the 
degradation in performance can be minimized by making the band- 
width observed by the limiter sufficiently wide that the average broad- 
band noise power dominates both the signal and interference powers. 
This conclusion follows from the fact that such a procedure minimizes 
both the degradation in SNR and the decrease in SIR mentioned above 
since it ultimately requires that a approach unity. In particular, the 
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data plotted in Fig. 3 show that the signal-to-noise power ratio SNR is 
degraded by less than about 1.3 dB as long as the total average broad- 
band noise power is greater than the average narrowband interference 
power. In addition, the data plotted in Fig. 4 show that the signal-to- 
image power ratio SIR is greater than about 14.5 dB as long as the 
total average broadband noise power is greater than the average narrow- 
band interference power. This SIR result is interesting since it is indic- 
ative of the fact that intermodulation products do not grow as rapidly 
with increasing interfering-signal power in the situation analyzed here 
as they do when the ideal limiter is driven by two sinusoids plus narrow- 
band Gaussian noise. This conclusion follows from comparison of Fig. 
4 with the results of Jones as presented in his Fig. 4.” The difference 
in behavior appears to be due primarily to the fact that the strong 
narrowband signal in this analysis is a gaussian process and not a 
sinusoid. 

It is of course true that the conclusions reached above based on the 
data plotted in Fig. 3 are conclusions based on the assumption that 
the broadband covariance function R,(r) is the triangular function 
specified in equation (20). This example was chosen as a typical ex- 
ample that is computationally convenient for studying the degradation 
in signal-to-noise power ratio SNR as a function of interfering-signal 
strength. It is also of interest to study the dependence of the degrada- 
tion in SNR on the choice of R,(r), and it is noted that this can be ac- 
complished by using Sy,(f) given by equation (14) instead of Sy, (f) 
given by equation (21) in the calculation of SNR,/SNR; . 

Finally, it is emphasized that the results leading to the above con- 
clusions are asymptotic results that apply when the broadband noise 
correlation time 7, approaches zero. As discussed in Section I, our 
interest in small 7, stems from a desire to model the situation in which 
the average noise power in the spectral band occupied by the narrow- 
band signal may be comparable to the average signal power but in 
which the total average noise power is much larger than the average 
signal power by virtue of the large noise bandwidth observed by the 
limiter. Thus we have a practical interest in the situation of small 7, , 
although it is of course true that the situation of engineering importance 
is that in which 7, although small is greater than zero; for example, 
a < 1 makes physical sense only if 7, > 0. With this in mind, it is of 
interest to determine the conditions that must be satisfied for the 
results of this work to be useful when 7, > 0, and inspection of the 
analysis performed leads to the following conclusions (when the broad- 
band noise covariance function R,(r) is written such that the band- 
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width of the broadband noise is approximately 77’): In order for the 
power spectral density result given by equation (11) and the signal- 
to-image power ratio result plotted in Fig. 4 to remain useful, it is 
necessary that certain conditions be satisfied: 


(t) The broadband noise correlation time must itself satisfy the 
condition 7; < 1. 
(it) The input signal-to-noise power ratio 


a Rs) _ ,, Rs(O) 
8 By) Cy” 


must satisfy the condition ns < 1. 

In addition to these conditions, in order for the power spectral density 
results given by (13) and (17) and the signal-to-noise power ratio results 
plotted in Fig. 3 and 5 to remain useful, it is necessary that the condition 


0:71 K 1, t= 0, 1, 2, (44) 


(10) 


be satisfied. This last condition requires that the bandwidth of the broad- 
band background noise be much larger than the largest of the center 
frequencies wo , w; , and w.. The necessity of this condition was noted in 
Section I, and it was pointed out that this condition is not satisfied in 
communications situations in which the bandwidth of the broadband 
noise is much larger than that of the narrowband signals that may be 
present but much smaller than their center frequencies. However, in- 
spection of the derivation of equations (13) and (17) shows that, if we set 


Wo = WO = Oo/7, and = w. = &/m +, , (45) 


then we have constructed a model for these “narrowband”? communi- 
cations situations for which equation (13) and (17) hold except for the 
term Sy,(f) which is now given by 


Sy,(f) = =i {arcsin [ap,(r) + (1 — a@)p.(7) cos wer] 


— arcsin [(1 — a)p2(r) COs wer]} cos wr dr. (46) 


Signal-to-noise power ratio results corresponding to those plotted in 
Figs. 3 and 5 can be calculated (numerically) using equation (17) with 
Sy,(f) given by equation (46) after making the simplifications that fol- 
low from the definitions of w» , w; , and we, given in equation (45). When 
a = 1 and & is large, the signal-to-noise ratio result corresponding to 
Fig. 3 will reduce to the result derived by Manasse, and others.’ 
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APPENDIX A 


Calculation of Output Power Spectral Density 


Using the characteristic function method discussed by Rice [Ref. 17] 
it can be shown [Ref. 12, p. 308] that, if the input to the ideal limiter of 
Tig. 1 is given by equation (1), then the autocorrelation function at 
the limiter output 


Ry(r) = (Y()Y*(t — 1))ew (47) 


is given by equation (7). Defining the input signal-to-noise power ratio 
ns according to equation (10), it follows that 


oe ee 


Ry(r) 


k=0 m=0 ak! m! 
T_ps(2) || pw(1) ii 
| este Pee (ce Oat 
= @, otherwise. (48) 


It was pointed out in the text that we are interested in the situation 
where ns is small due to the large bandwidth of the broadband back- 
ground noise. Motivated by this, it is noted that, upon summing on 
m, equation (48) can be written as 


Ri) = A(t) 


k=0(even) 7 


7K! 
| pqttt ELL, 8. [alo }3 py(7) Fee ls 
sale ee kee Bee ae 


tes 2) 292 Ae aed (eee 
Noting that ./',(a, b; c; x) is finite for all | z| < 1 aslongasc # m 
-(m = 0, —1, —2, ---)* [Ref. 18, p. 556], it follows that 











+ Gauss’s hypergeometric function is also absolutely convergent at | z |= 1 as 
long as Re (c — a — b) > O. Thus in fact 


ey 


2Fi(3, 33 351) = 
which implies that the series 


. 1 1.3 
aresin x Statgget+aqee tr: . 


converges for all! 2] S 1. 
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_2 M8, (eal) eu 
Ry) = 2{ar,|3 358; 1+ ns 1+ ns 


o}l 1.1. (ew) \" | _es(r)_ 
rs cee a eee 
+ Olnses(r)ew(7)] + Olnses(7)] (50) 


as ns — 0, for all + such that | py(r) | < 1. Moreover, by expressing the 
hypergeometric function in the first term of equation (50) in its series 
form and then appropriately collecting terms, it can be shown that 


w(7) 1 l 3 s py(7) 4 
gras wh .3:8: Gey | 
= aresin py(r) — py(7)[1 — px(7)) ? 05 =F O[nspw(7)] (51) 


as 75 — 0, for all + such that | py(r) | < 1. Also, it is immediately recog- 
nized that, in the second term in equation (50), 


_es(t) ll 1,1, (_ew(r)_\ 
eet ae 2 2 (ce) 
= ps(r)[1 — px(z)]* + Olnsps(7)] (52) 


as ns — 0, for all + such that | py(7) | < 1. Therefore, recalling that the 
noise N(¢) contains a broadband component so that in fact 


| pw(r) | <1 (53) 


for all | r | > 0,' it is concluded upon substitution of equations (51) 
and (52) into equation (50) that 


Ry(s) = = {aresin py(1) + fps(s) — en (aI — o&(o)I 8a} 


+ O[nsps(7)] oF O[nsp(7)] (54) 


as ns — 0, for all|7| > 0. 
In order to calculate the power spectral density Sy(f) at the limiter 
output it is necessary to evaluate 


Sif) & 2 | “RG sesserde. oe (55) 





} Note that this follows from the integrability condition placed on the broad- 
band covariance function Ri(r) by (5). This integrability condition implies that 
| p(x) | < p(9) for all | x | > O and requires that the power spectrum of the broad- 
band noise contain no line components. 
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As it stands, Ry(r) given by equation (54) is not enough because of the 
difficulty as r — 0. It is not clear from the foregoing analysis whether 
or not the representation given by equation (54) is valid as 7 — 0 
when 7s — 0, and in fact this representation may be valid for all ps(r) 
and py(r) of interest [compare Ref. 18].* In any event, the difficulties 
involved in evaluating the remainder terms in order to examine this 
possibility can be circumvented by using the well-known result that 
Ry(r) is also given by equation (8).° Thus, 


_ 2, p(t) + nsps(7) 
Ry(r) = ~ aresin are (56) 
which implies that 
Res = aresin aoa) (57) 


as ns — 0, uniformly in 7. In fact, making use of the expressions for 
Ry(r) given by equations (54) and (57) in conjunction with the expres- 
sion for ns given by equation (10) and the integrability condition in 
equation (5), it is seen that, if R,(r) can be written in the form speci- 
fied by equation (8) and the parameters a and Rs(0)/C> satisfy the 
conditions a > 0, Rs(0)/C. < ©, then Ry(r) can be expressed:1 


Ry(r) = = aresin py(7) + o(1), Os|ri{sr7, 


2 {resin py(r) + a Hel) [es(r) — py(r7)][1 — Atala} 


+ O[rips(t)] + Olriey(7)], [rl 2n, (58) 


sa r, — 0. Substituting this result into equation (55) and assuming 
that the integrability conditions 


| wer cur ee (59) 


/ erence (60) 


are satisfied, there results 


* McFadden derives a similar expression for the case of a weak sinusoid in additive 
gaussian noise and asserts that the expansion is valid at 7 = 0 as long as py(r) 
satisfies certain differentiability conditions. 

t Another method for obtaining equation (58) is to expand equation (56) in a 
Taylor series about py(r). 
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Sy(f) = 4if arcsin py(7) cos wr dr 


$+ Hs 5, [taste — pu(e)IIL — ek()I? cos er ar} + om) 
(61) 


as r, — 0, uniformly in f. This result can immediately be simplified by 
observing that the predominant contributions to Sy(f) due to interaction 
of the signal and noise processes are due to interaction of the signal 
process with the narrowband interference component of the noise. 
In fact, noting that 


pn(r) = api(r) + (1 — @)po(r) cos wer, (62) 


it can be seen that equation (61) reduces to equation (11). 


APPENDIX B 


Derivation of Output-Power Spectral Density Expansion 


It is shown in Appendix A that the output power spectral density 
can be expressed according to equation (11); namely, that 


Sy(f) = Sy,(f) ae So Hal) T1 [ [es(7) a (1 = a) po(7) COS weT] 
-[1 — (1 — a)’p3(r) cos? wr]? cos wr dr + o(7;) (63) 


as Tr, — 0, uniformly in f, where 


Sys(f) & 4 i Akos (1 ae eod os | oases 10D) 


Tv 
is the output power spectral density when the noise NV (¢) alone is present 
at the limiter input. Sy(f) can be put in a more useful form by expanding 
both [1 — (1 — a)’p2(r) cos” wer]? and aresin [ap,(r) + (1 — @)po(z) 
cos wer]. Proceeding with expansion of the latter it is seen that [Ref. 
13, item 15.1.6] 


aresin [ap;(7) + (1 — @)p2(r) cos w27] 

1 = re 3 2m+1 
BF POD gi lous) + = 0) ps(r) 008 er 
1S Mmt) SR Omt+D! 

Qn? p> Tim + 3)m! d, (2m + 1 — 7)!7! 


ACh a) p2(7) Cos wet] [op (7) PP"? 


I 
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ey T? x 
= aresin [(1 — a@)p.(7) cos wT] + 5. a xu Te Del 


Se Om+D! og ne 
> Omit b = gly) [2 — a)pa(z) cos w27}' [o-p1(7)] . (65) 
Thus, substituting equation (65) into equation (64), we have 








Syy(f) = Sy, (f) = Sy, (f); (66) 
where 
ee is +3) S3__@m+)! 
SO =a], 2 coe + mi &Om+1— pia 
-[(1 — @)po(r) cos wot)’ [api(7)]?"**? cos wr dr (67) 
and 


Sy, ff) = : r aresin [(1 — a)p2(7) cos w.7] cos wr dr. (68) 


We have succeeded in breaking Sy(f) into a broadband component 
Sy,(f) plus a component Sy,(f) consisting of narrowband contributions. 
In fact, letting « % 7/7; , it can be seen, using the integrability condition 
of equation (5), that 


iA [(1 — a)po(r) cos wet)’ [apy (7)]?"*** cos wr dr 


I 


jee 


ioe] 
1 / [(1 — @) po(71%) COS W712)’ [ap(x) cos 1712 coswr,x dx 
0 


1 i (1 sot wo pe) dz 4. o(71) (69) 
0 


ast, — 0, for all f S fuss < © for arbitrary fixed f,,. , aS long asj < 
2m + 1. Moreover, using this integrability condition plus the fact that 
the series in the integrand is absolutely convergent, it can be shown that 


_2 f° A Pte Bmp! 
80 = | Lae bmi Um FT = He 


‘(1 — @)'[op(a)}""! da + o(7:) (70) 


as 7, — 0, for all f S fmax < 0, which can be written as 


Sy,(f) = a iA {arcsin [ap(x) + 1 — al 
— aresin (1 — a)} dx + o(71) (71) 
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as 7, > 0, for all f S fax < © for arbitrary fixed f,,.. . Thus it is seen 
that the broadband component Sy,(f) becomes white across any fre- 
quency band of finite extent as 7, — 0 and moreover that, if a = 1, 
then Sy,(f) is just the output power spectral density that would be 
observed if the broadband component of the noise was present alone 
at the limiter input. 

Turning now to Sy,(f) given by equation (68), it is seen that 


arcsin [(1 — a@)po(r) Cos wo7] 


1 SAMe+4 
=a 2 Tht Del 





[((1 — a)pe(7) cos wet)" 


co 


~1 FPE+) yy ae 
~ Daf pa T(k + 3)k! ce a) p2(7)] 


; 2k + 1)! 
OE TFT HF 008 Ch + 1 — Boar, (72) 


Now, letting k — r © mand then interchanging the order of summation 
on k and m, there results 





M 


arcsin [(1 — a)po(7) COS wet] 


oj ee Ce Ce 
Qe? Kh AX Th + DK (ke + om + 12)! (k — m)! 2” 


-[(1 — @)po(7)]"*** cos (2m + wor. (73) 
However [Ref. 13, item 6.1.18], 
(Qk + DY! = (Qn) 324k + IT(k + 8) (74) 
so that equation (73) can be rewritten as 


arcsin [(1 — a)p2(7) cos w27] 


Lae (ik +4 ok+1 
= - p> pz (k af ws 1)! = m)! (1 = a) p2(7)| cos (2m F I)wer 


1 = = iy ] 3 27+2m+1 . 
-i> dy ne SERS — dati?" cos @m + Dowt 
1ST 3 i i 2 
= Remy alm + Bm + 4 2m + 25 — well 


-[(1 — a)po(7)]°"** cos (2m + 1)wor. (75) 


Substituting this result into equation (68), we obtain the result stated 
in equation (15). 
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The expansion of [1 — (1 — a)’p2(r) cos” 27]? in the second term in 
equation (63) can be pursued in a manner identical to that used above 
for the expansion of aresin [(1 — a@)po(r) cos wer], and the result obtained 
is that given in (18). 

It is pointed out in the text that the assumption p.(r) = 1 greatly 
simplifies the expression for Sy(f) without obscuring the most important 
effects that result from the presence of narrowband interference. In 
particular, it is seen that the assumption p2(r) = 1 violates the inte- 
grability condition in equation (60). As a result, equation (13) does 
not hold uniformly in f under this assumption since the points f= “hf, , 
k = 1,3, --- , must be excluded. However, it is observed that equation 
(13) can be made to hold at these points as 7, — 0 by addition of the 
remainder term 


ofc [| pale) | dr): (76) 
0 
Moreover, it is seen from equation (15) that, when p2(7) = 1, Sy,(f) is 
nonzero only at f = kf. ,k = 1,38, --- , and its value at these points is 
of J” | este) | dr) (77) 
10) 


Thus in fact it can be seen that, when p.(r) = 1, it is meaningful to 
write Sy(f) as given by equation (17). 
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Rate-Distortion Functions for 
Gaussian Markov Processes* 


By BARRY J. BUNIN 
(Manuscript received June 5, 1969) 


The rate-distortion function with a mean square error distortion criterion 
is investigated for a class of Gaussian Markov sources. It 1s found that for 
rates greater than a certain minimum, the rate-distortion function is equiva- 
lent to that of an independent letter source. This minimum rate was found 
to be less than n bits per symbol, where n is the order of the Markov se- 
quence. Comparisons between the rate-distortion function, and two quantiz- 
ing systems are made. 


I, INTRODUCTION 


Suppose in the communication system of Fig. 1, the source emits a 
sequence of continuous-valued random variables. The exact specifica- 
tion of such variates requires an infinite number of binary digits. Hence 
exact transmission would require a channel of infinite capacity. Since 
no physical channels possess infinite capacity, we see that exact trans- 
mission is impossible through this system. 

However, if we are willing to accept some error in our specification 
of the source output, then finitely many binary digits are necessary. 
In the study of digital encoding systems, a useful quantity to know is 
the fewest number of binary digits necessary to represent an analog 
signal within a certain error. Such a quantity would give us a perform- 
ance criterion with which to compare existing systems, and also tell us 
how much improvement is possible. 

The quantity we seek is given by Shannon’s rate-distortion function." 
The rate-distortion function gives, for any bit rate, the minimum pos- 
sible error achievable. 


In this paper we study the rate-distortion functions for the important 
* This research was partially supported by the Air Force Office of Scientific 
Research under Contract AF 49(638)-1600. This paper is part of a dissertation 


submitted in 1969 to the Faculty of the Polytechnic Institute of Brooklyn, in partial 
fulfillment of the requirements for the Ph.D. degree in systems science. 
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an 
XN Yn YN Xn 
SOURCE ENCODER CHANNEL DECODER RECEIVER 


Fig. 1 — General communication system. 


class of gaussian Markov sources. We measure our error by the mean 
square error criterion. Also, the performance of two quantizing systems, 
differential PCM and block quantizing, is compared to the rate-distor- 
tion bound. 


II. DISCUSSION OF RESULTS 


We have studied the rate-distortion functions of gaussian Markov 
sources with a mean square error criterion. We express our results in 
Fig. 2 by plotting signal-to-noise ratio in dB, versus bit rate R. The 
signal-to-noise ratio is given by 


2 
S/N = 10 logie 5 (1) 


where o” is the variance of the source output, and D is the mean square 
error. 

It was found that for rates R greater than a certain R,,;, , the rate 
distortion function is given by 


2 


R= } log. = 0<Dse2 (2) 







S/N IN dB 





DIFFERENTIAL PCM 
AND 
BLOCK QUANTIZING 





~ 
\ 


\ 
10 Log 2 4.34 dB 
o2 


m 





Rein n 
R (BITS/SYMBOL) —> 


Fig. 2 — Rate-distortion bound of a Markoy-n source compared with block quantiz- 
ing system and differential PCM. 
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or 
2 


S/N = 6.028 + 10 logio “s (3) 
where o is the minimum mean square prediction error one step ahead. 
The point F,,;, occurs in the interval (0, n) where n is the order of the 
Markov process that the source emits. The exact location of Ruin de- 
pends on the exact shape of the power spectral density of the process, 
as we shall see. At R = Rain, the rate-distortion function has a dis- 
continuity in the third derivative. 

If the source were followed by the optimum prediction system of Fig. 
3 then the output sequence produced would be uncorrelated with vari- 
ance o, . Such a sequence has the rate-distortion function given by (2). 
Hence for rates greater than R,,;, the sequences at the input and output 
of the prediction system have equal rate-distortion functions. For rates 
less than R,,;, they do not. 

A lower bound on the performance achievable by the block quantizing 
system of Fig. 4 was found. The result is also shown in Fig. 2, where it 
is seen that this system can be made to perform within 4.34 dB of the 
bound. 

Also shown in Fig. 2 is the performance bound for a differential PCM 
system (see lig. 5) as derived by O’Neal. This bound however, holds 
only for high bit rates. 


III. RATE DISTORTION FUNCTIONS FOR MARKOV-N SOURCES 


3.1 Introduction 


Consider again the communication system of Fig. 1. The source emits 
the discrete time, stationary random process x,, t = 0, +1, +2,::-. 
After N seconds, a column N vector X is obtained, and after encoding, 
transmission and decoding, the receiver obtains a replica X of X. The 
mean square error between the transmitted and received vectors is 


SOURCE e > 
OPTIMUM & 
PREDICTOR A 


Fig. 3 — Predictive communication system. 
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a , 
Xn UNITARY Yn YN Xn 
ADAPTIVE +4 
SOURCE TRANS TORMATION Oanriace A 
Fig. 4 — Block quantizer for correlated source. 


defined by 


D 


I 


1 ae ‘ 
we — XK — X) (4) 


where E denotes expectation and X” is the transpose of X. It is reason- 
able to ask what the minimum bit rate is, at which we must transmit, 
so as to be able to achieve a mean square error less than some prescribed 
amount. The answer is given by Shannon’s rate-distortion function 
which is defined as follows:’"” 


R(D) = lim min ff p(Xyp(Ly | Xv) 


N02 


p(X | Xw) 
p(X y) 


where the minimization is taken over all p(Xy | Xw) satisfying 


-log, dX, dXy (5) 


(D) = 5 [| Xv — £)"Kv - By) 
-p(Xw)p(Xw | Xv) dXy dXy < D (6) 
and where 


p(Xn) = probability measure of the source vector Xy 
p(Xy | Xw) = conditional probability measure of Xy given Xy 
p(Xvx) = probability measure induced on Xy by p(Xy) and 
p(Xw | Xy). 


SOURCE GC QUANTIZER 
, 
- OPTIMUM es! 
PREDICTOR 


Fig. 5 — Differential pulse code modulation system. 













RATE-DISTORTION FUNCTIONS 3063 


(The subscript N is included to emphasize that we are dealing with an 
N-vecior.) 

Suppose the source emits a stationary gaussian time series with cor- 
relations H(x,;2,) = r;-, = 1,. Then the discrete time power spectral 
density is given by 

fa) = Dore -r-sSrSr (7) 


and the rate distortion function is given parametrically by’ (see Fig. 6 
for interpretation) 


Ri) = 5 | tog 8(a) 
D@) = faBe] wo 8(b) 
= {h:10) 2 4} 


A’ = {aA :fQ) < 4} 
and 
AU A’ =(-7,7). 


Hence, if we are given a distortion D, from (8b) we can find ¢, and 
then from (8a) we can find the theoretically minimum rate R necessary 
to achieve a mean square error less than or equal to D. If {x,} consists 





-Td-4 AZ A-2 rA-4 0 A; Az A3 AAT 


Fig. 6 — Graphical interpretation of equations 8a and b. The set A = (~7, A_4) 
S eae Ty U (Ary M1) U (Ag, Az) U (Nay w). A’ = (AL4, » —3) U (A_2, r-1) U (Gi: Ae) 
3) 4 
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of independent Gaussian variates, with variance o’, then f(\) = o° and 
(8a) becomes 


2 
R(D) = 3} log, 5 bits/symbol. (9) 


If we restrict the class of sources to be wide sense Markov of order 
n, then f(A) assumes the following form: 


ee (10) 


I] |e — a; |? 
7= 


with 0 < a; < 1,a; 4 a,if 7 # k, and K is chosen to satisfy 


o? = B{x2} = =f. f(d) ad. (11) 


In the remainder of this paper we consider some properties of the 
rate distortion function as given by (8a) and (8b) for processes with 
power spectral density (10).* 


3.2 The Markov-n Sequence 


In this section we present some results from prediction theory. 
For details and proofs see Refs. 6 and 7. 
A process with power spectral density given in (10) is known as a 
Markov-n process.’ Performing the indicated multiplication in (10) 
results in 


A sequence with the spectrum (12) can be shown to satisfy the autore- 
gressive relation 


n 
In + >. Dita: = &n (13) 
i=1 
where {e,} is a sequence of uncorrelated random variables with variance 


K. 
Writing (18) in the form 


Lt, = — >» DiXn-~i + (14) 
i=1 


* T. Berger, in a recent paper considers similar properties for the Weiner process‘. 
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it can be shown by the orthogonality principle (Ref. 8, Section VII-C) 
that the best linear predictor in the mean square sense, of x, given the 
infinite past is just 


Bec Se Be og: (15) 
a=1 


Hence for a Markov-n process the best prediction involves only the 
m previous samples. 
The error is 


€@=2, —£, = &. (16) 
The minimum mean square error is thus 
o? = E(e,)? = K. (17) 


From (10) and (17) 
+ | togs i() dd = log es Df logle*-a,f. (8) 
On a S2 D2 Um Qn fede 2 7 . 


From Peirce’s tables,’ number 540, it can be shown that the integral is 
zero (recalling that 0 < a; < 1). We state our conclusion as a theorem. 


Theorem 1: For a sequence with spectrum given in (10) the minimum 
mean square error resulting from an optimal prediction one step ahead is 
a”, , where 


jen x: J 092 £0) aa. (19) 


Theorem 1 is a special case of the theorem proved in Ref. 6, page 188. 


3.3 Evaluation of R(D) for D S f(x) 

We next consider the particular form that equations (8a) and (8b) 
assume when f(A) is as given in (10). 
Theorem 2: Gwen a process with 


f(A) = a 


I] | e™ — a; [F 
j= 


for some integer n. For mean square errors satisfying 0 S D S f(r), R(D) 
ts given by 





2 
R(D) = 4 log, a bits/symbol. (20) 
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Proof: From (8a) and (8b) 


R@) = 5 [ tog, A 


D=5-foat+] a. 


The power spectral density f(\) is monotonically decreasing with a 
minimum at \ = 7. Hence for ¢ in the range 0 S ¢ S f(r), A = (—7, 7), 
A’ = @, and 


1 Tv 
D=5-] ¢a=¢. (21) 


It follows that 


R@) = RD) = 5 | low f0)S*— Flog, D. 2) 


From Theorem 1 the first term is $ log o2 so R(D) = } log, o2/D 
which holds for 0 < D S f(r). This is (20). 

The rate-distortion function (20) is precisely the rate-distortion func- 
tion of a process consisting of independent gaussian random variables 
with mean 0 and variance o? [see (9)]. 

Figure 7 illustrates why the rate-distortion function depends on 
f(z) in this way. The shape of the spectrum of D in (8b) is that which 
would be assumed by water if it were poured into a container shaped as 
f(A). As we pour in water, it distributes itself uniformly so long as its 
level is below f(a). Hence D is independent of f(A) so long as D < f(z). 
Once D = f(x) the exact shape of f(A) comes into play. 

Consider next the predictive communication system of Fig. 4. The 
source emits the gaussian process with power spectral density (10). The 





-7 T 


ad 


Fig. 7 — Typical Markov spectrum, illustrating water filling interpretation of the 
rate-distortion function. 
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optimum predictor makes a prediction of x, based on {2,}%25 . This 

prediction is then subtracted from z, and the error is transmitted. The 

transmitted sequence is thus the sequence {e,} [see (14)] which is a 

sequence of uncorrelated gaussian random variables with variance oc; . 

Its rate-distortion function is thus also given by (20), for D in the in- 

tervalO < DSc}. 
From (1) 


S/N 


I 


2 
10 logy 7 


2.2 


l 


o Om 
10 logy =D 
2 


2 
3.01 log, = + 10 logy 3 


I 


2 


6.02R + 10 logy 25 (23) 


2 
om 


lI 


since F is given by (20). Hence S/N is a linear function of 2 over the 
range of R for which 0 S$ D S f(z). This range depends on n, the order 
of the Markov process, as given in theorem 8. 


Theorem 8: For an nth order gaussian Markov process, the rate-distortion 
function ts given by 


2 
R(D) = 3 log, bits/symbol 


for rates R = Ryin. The value of Rinin depends on the exact shape of the 
power spectral density f{(\) and assumes a value satisfying 


0< Rain <n bits/symbol (24) 
depending on the a;’s of £(\) [see (10)]. 
Proof: From (10) 


f(\) = + 
Uy |e _—a r 
From this 
Oe (25) 
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At D = f(z) 
2 
Rin = R(f(x)) = 4 log, iG) bits/symbol (26) 


which from Theorem 1 is 


[2 i : log. f(A) d\ — log, ie | 


| toe, ea b> i log. |e — a; |? da 
20 f=l1 ’-7 


Dole bol 


— log, K + >> log, (1 + a, | (27) 
7=1 
As in (18) the integral is zero and | 
Rin = » log, (1 + a,) bits/symbol. (28) 


Since |a;| < 1, Ruin < 1 bits/symbol. Hence, 0 < Ruin < 7 bits/ 
symbol, which is the desired result. 


3.4 Behavior of R(D) at D = f(z) 
With f(A) as given in (10), the rate-distortion function is, from (20) 


2 
R(D) = 4 logs F 


for0 < D S f(r), and from (8a) and (8b) 


Ropes an fox, ray (29a) 
po) =4| [rover + [1 | (29) 


for f(r) S D S ao’. Writing (8a) and (8b) in this form follows from the 
observation that for a monotonically decreasing power spectral density 
the set A equals the simply connected interval (0, \) and ¢@ = FO); for 
the appropriate i. 

From (20) 


ah _ n(n — 1)! 
oe = (1 SS 





D™n2 O0<D<{f() (30) 


and from (29) 
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In 2 (31) 








In 2 (82) 


BR _ _ x £Q) + 2M) 
dDi~ 2S FONF'O) 


for f(t) < D < a’, where f’(A) = df(d)/dd 

From (30), (31), and (32) we see that dR/dD and d’R/dD’ are con- 
tinuous at D=f(r). But from (33) we see that d’R/dD® — — © asD—> 
f(r) from above (since f’(r) — 0), whereas d°R/dD’ is bounded as D > 
f(x) from below. Hence d’R/dD’ is discontinuous at D = f(r). 





In 2 (33) 


IV. QUANTIZING CORRELATED SOURCES 


4.1 Introduction 


Consider a source that emits a sequence of independent gaussian 
random variables of mean 0, variance o”. It is desired to optimally quan- 
tize the source by using an M/ level quantizer. Max’’ has shown that by 
optimally choosing the quantizer input ranges and output levels, a mean 
square quantization error of 


D, = K(M) © (34) 


can be achieved where K(M) is a function of M. Further, it is shown 
numerically that K(17) S 2.72, and that the inequality becomes an 
equality as IJ — o. Hence for any J 


2 
D, S 2.72 aft (35) 


For an M level quantizer the number of bits/symbol is R = log, M, so 
that (35) can be written 


DS 242 (36) 
The rate-distortion function of the process is from (9) 
2 


R= } log 


so that the minimum possible mean square error achievable with a fixed 
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bit rate A is 


Co 
Dain = g2k (37) 
Hence Max’s scheme can be made to achieve a mean square error 
satisfying 


D, S&S 2.72 Dain (38) 


where D,;, is the minimum mean square error as given by rate-distortion 
theory. 

In this section we find a bound on a quantizing system studied by 
Huang and Schultheiss.** Our result is that (38) holds also for correlated 
sources, when D,,;n iS aS given by the appropriate rate-distortion fun- 
tion. For the case of Markov sources we plot this result in Fig. 2. 


4.2 Description of the System 


Referring to Fig. 4, the source emits correlated gaussian variates (not 
necessarily Markov), of mean 0 and with correlation matrix ® = 
E(XX"). The operator A accumulates source N-vectors X, and rotates 
them in such a way that 


Y=AX (39) 
and 
E(YY’) = E(AXX"A’) = AE(XX")A” = AGRA” = J (40) 


where J is a diagonal matrix whose ith entry is \, , the 7th eigenvalue 
of ®. Hence Y is an N-vector whose components are independent ran- 
dom variables with mean 0 and variance \; , and A is a unitary trans- 
formation. 

The sequence of independent variates {y;} (the components of Yy) 
are then quantized step by step.*°’’* The jth quantization can be opti- 
mized to produce a mean square error of 


where M; is the number of quantization levels used to quantize y; . 


Denoting the output of the quantizer by the vector Y’, the average 
mean square error is 


pe wEe ~ yyy — yy) = Lay — yntataty — Y+) 


EX —XN(K-—X) (42) 


— 


Slr ain 
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where we have used the fact that for a unitary transformation AA” = 
AA™* = I, the identity matrix. Hence the system mean square error 
equals the quantizer mean square error. 

From (41) and (42) 


fe sole — yrr see , _l, . — a") 
D= 3 EY — Y)"Y V) = 7b Dw y') 
1 al = 

i=1 


4.3 Optimization over the M; 


We next tighten the upper bound by optimally choosing the ,’s 
subject to the following constraints. 


(2) Mf; 2 1 for every j. The quantizer must have at least one output 
level. 

(iz) The bit rate is limited by the channel capacity, C bits per symbol. 
We can thus use M = 2° levels per symbol or I” levels per vector. 
This implies the constraint 


N 
MY = [[ M;. (44) 
i=1 


Hence we wish to minimize the right side of (43) subject to (44), while 
keeping in mind constraint (2). 
With » a Lagrange multiplier, we form 


F=D,+ 1M". (45) 
A differentiation with respect to M, yields 


Nx 


1 = # (46) 


where p is a constant. Using (44) to solve for the constant gives 


M,=M Tay” _ (47) 
i)" 


t=] 


and 





N 1/N 
p= 2 (II r.) (48) 


t=1 


However constraint (2) will only hold if in (47) 
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N 1/N 
(TI rs) 
i=] 
Ng 25 : 2 : 


for every k. 
The right side of (49) can be written 


(II " eo loga Ai 
i=l] : 2 = 
M? 


M? 
ix fe loge f(A) an} 


M? 





9h 





2 
lon 


M? 


(49) 


(50) 


(51) 


(52) 


where we have used the fact that the eigenvalues of ® approach the 
ordinates of f(A) equally spaced in (—7, 7) as N — o (see Ref. 6), and 
then applied the definition of a Riemann integral. Finally, we used 


(19). Hence the constraint (2) is met if 


2 
oC 
= 


for all k. Using (50), (51), and (52), (48) becomes 
2 
D, = 2.72 7 


In terms of signal to noise ratio we get 
2 


o 
S/N D: 


l 


2 
10 logy, = = 10 logy 


2 
10 log, “ + 20 logy, 2 log, M — 4.34 


l 


2 
10 logy, 3 + 6.02R — 4.34 
for 
o 
L ce 
R > % logs F(m) 


and where we used the relation 
R = log. M. 


(53) 


(54) 


(55) 


(56) 


(57) 


Suppose, however, that for some },’s (53) is not met. Specifically, 
arrange the eigenvalues such that \; 2 A» 2 A3 --: 2 Ay and suppose 
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that (47) yields 
M, 21 k= jy Qed (58a) 
M,<1 k=J+1---N. (58b) 


Set those M, in (58b) equal to one, and reoptimize over the M, of 
(58a), the expression 


Dp= 272 er aE (59) 
subject to the constraint 
J 
II M, = mM”. (60) 
k=1 
We would find that optimally 
J 1/J 
d (I ») 
— = “itl | a ee f (61) 


MM? ~ Me as § 


where the right side of (61) is a constant. Without loss of generality, 
we can assume that all 17, obtained from (61) are greater than or equal 
to one. Otherwise we would set the infeasible 17, equal to one, and 
reoptimize. The procedure would return us to an equation similar to 
(61). AsN > «© 


ie ee af 
p. e278 (SAy+ > x») 


i=1 a i=J+1 
1 J N 
= 2.72 (Sy > rs) 
N t=1 is=J+1 
ee. 72 2 f ay + =| f(r) an| (62 
7 : 2Qr A . 2r A’ ) 
where A and A’ are as given in (8) with ¢ replaced by y. 
Similarly 
J 1/J 
(If) 
oo ee (63) 


which, upon rearrangement, becomes 


logs M= * > logs ~ 


t= 


v= reap log, mm dy. (64) 


R 


ll 
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By comparing (8a) and (8b) with (62) and (64) we see that (62) has 
the optimal spectrum for a rate given by (64). This implies that our 
procedure of setting infeasible J/7,’s equal to one does indeed lead to an 
optimum result. 

Further, the terms in brackets in (62) is the minimum mean square 
error for a rate given by (64). Hence the quantization procedure has 
yielded 


D, S& 2.72 Dain 


which is (38). 

This result is plotted in dB in Fig. 2, for the case of a Markov-n 
process. 

There is an approximation involved in obtaining this result. The 1/; 
obtained may not be integers. However, the large 1; will be little 
affected by rounding, and the looseness of the bound of (88) for small MV, 
counteracts the effects of rounding the small MV; . In fact, for very small 
M; the bound is conservative, as we can see from Fig. 2. Clearly S/N 
should approach zero as R goes to zero. Hence our lower bound on S/N 
is loose in this range. 
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The Optimum Linear Modulator for 
a Gaussian Source Used with 
a Gaussian Channel 


By RANDOLPH J. PILC 
(Manuscript received June 12, 1969) 


The optimum linear modulator and demodulator which provide transmis- 
ston of a gaussian vector source through an additive gaussian vector channel 
are derived in this paper. The measure of performance that is used is 
the transmission distortion, which is defined here as the mean square error 
between the source output and the decoder output. It 1s assumed that the 
source and channel are mutually independent but that correlations can 
exist among the components of each. The performance of the best linear 
system is then compared with the distortion shown by Shannon to be 
theoretically obtainable when no functional constraint 1s imposed at the 
modulator other than an energy constraint. Although the precise form of 
this optimum modulator 1s not known for general gaussian vector sources 
and channels, tt 1s known to be nonlinear and to require arbitrarily long 
coding block lengths. However, 1t is a commonly held notion that when 
the source and channel dimensionalities are equal the optimum modulator 
ts linear and requires a block length of only one. It 1s shown here that 
this belief is incorrect except in very particular situations which are de- 
scribed. Some relations between the optimum linear modulator-demodulator 
pair and Shannon’s test channel are discussed, and an example is in- 
cluded which shows that the nonoptimality of linear devices can be quite 
small. 


I, INTRODUCTION 


We are concerned here with the transmission of a gaussian vector 
source over an additive gaussian vector channel. The mean square 
difference between the source and decoder outputs is used to measure 
the transmission distortion in the system and is, therefore, attempted 
to be minimized in the design of the encoder and decoder. In this 
design the encoder is constrained to present only a limited energy to 


3075 


3076 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1969 


the channel, thus constraining the transmission capacity of the system.’ 
It is because the transmission capacity of the system is limited in this 
way that the given gaussian vector source cannot be transmitted with 
arbitrarily small error. 

The distortion which necessarily must exist in the system is pre- 
scribed by Shannon’s rate-distortion theory.” This theory states that 
when the transmission rate in a system is limited to R, the transmission 
of the source must include an average distortion of at least dp , which 
in general is a function of the source statistics and the distortion measure. 
The theory further states that the distortion level dp is attainable 
with some modulator-demodulator pair. Unfortunately, the precise form 
of this modulator and demodulator is not known in general, except 
that it is nonlinear®’* and that it requires the use of arbitrarily long 
coding block lengths.” 

Since the nonlinearity of the optimum encoder is probably a very 
complex twisting of the source space locus within the channel input 
space, the implementation of the optimum encoder, even if it were 
known, would be extraordinarily complex. Of course, the long coding 
block length requirement does nothing to help the situation. For these 
reasons we study in this paper the optimum linear transmission system, 
restricting both the encoder and decoder to be linear operators. Such 
a system uses a block length of only one and is very simple to implement. 
(It is later shown that increasing the block length does not improve 
the performance.) 

The degradation in performance with the use of the optimum linear 
system is found by comparing the resulting distortion to that of the 
optimum nonlinear system as found by Shannon. Contrary to popular 
belief, the best linear system does not provide the minimum attainable 
distortion, even when source and channel dimensionalities are equal, 
except in very particular situations that are described. However, in 
many cases the difference is small. At the end of the paper we discuss 
some relations between the optimum linear modulator-demodulator 
pair and Shannon’s test channel.” 


II. THE LINEAR TRANSMISSION SYSTEM 


The system considered is shown in Fig. 1. The N, dimensional 
zero-mean source vector w is linearly modulated by A to form the 
input to the N, dimensional additive gaussian noise channel. We 
assume the noise vector 7 to be independent of w. The linear demodu- 
lator B extracts from the received vector y an estimate w of the source 
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Fig. 1 — The linear system. 


which is presented to the user. In summary 
m= By = Biet+n) = BiAw+ n). (1) 


The measure of distortion in the system is taken to be the sum 
of mean-square errors between the components of w and %, that is 


Ns 

d=E{lw—-o/]= | (w; — at} (2) 
i=1 

The modulation matrices, A and B, are sought which minimize this 

distortion, their choice subject only to an average channel input energy 

constraint, 


Ne Ne 
Sr = | > “| an >> Var a; ’ (3) 


i=l 
= So ’ (4) 


which obviously will be met with equality in the optimum system. 

It is well known that the minimum mean square error estimate of 
any quantity (here the source vector w) based on the observation 
of a second quantity (here the channel output vector y) is the condi- 
tional expected value of the first given the second.* Further, the average 
error made with such an estimate is the conditional variance of the 
first given the second. Therefore, we have 


0, = Ew; | y); 7=1,2,::- ,N, 


(5) 
d 


I 


s Var (wu; | y). 


The required conditional density p(wly) can be found from 
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p(w) = ky exp [—3w'&,'v] 

and 

p(y|w) = kz exp [—2(y — Aw)'@7"(y — Av)] 
by application of Bayes rule. The result is 

p(w |y) = ks exp [—3(w — 0)'85),(w — #)] 
with 

Boj, = A'b,"A + 7 (6) 

and 


wy" = y PA, * (7) 


From these equations we have one immediate result, that is, theoptimum 
demodulator matrix is given in terms of A by 


B= @,),A‘'&,'. (8) 

If we now rewrite equations (5) and (8) as 
d = trace ®,1, (9) 
Sr, = trace ®, (10) 


we can restate our problem as that of finding the matrix A which 
minimizes the trace of ©,,,, subject to a constrained maximum trace 
of ®,. 


III. THE SOLUTION UNDER CERTAIN ASSUMPTIONS 


We first restrict our attention to systems in which the source and 
channel dimensionalities are equal, N, = N,. = N, and in which the 
correlation matrices ®, and ®, are diagonal. From equation (6) we 
have 


&,6,,, = &,A'S7'A + I (11) 


wily 
and from equation (1) that 6, = A®,A‘ and 6, = ®, + ©®,, which 
provides 
@,&,' = Ab, A'S! + J. (12) 


Noting that #, enters these equations in a more symmetric way than 
dose ®,, we recast the energy constraint in equation (10) to be in 
terms of the received energy at the channel output. This energy equals 
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Ne Ne 
Se = x| | = >) Var y; 


i=l t=1 


= trace ®, 


I 


trace ®, + trace ®,, 
which, if trace &, = N,, is constrained to satisfy 

Se SS +NMo. (13) 
3.1 The Proof that the Optimum Modulator Matrix is Diagonal 


If we denote the characteristic polynomial of a matrix M in the 
variable \ by 


c.p. [M, \] = det (M — XJ) 


and state that M; is square, we can use the following two matrix 
properties:° 


(7) c.p. [M,M, , dA] = ep. [M.M, , vr] (14) 
(72) ep. (MZ, ,d\] = cp. (MW, + 7,4 — IJ (15) 

to conclude from equations (11) and (12) that 
c.p. [B,,),, \] = ep. [6,8;", J. (16) 


It is this equation which provides the important relations among the 
correlation matrices in the system. 

We note that the set of matrix pairs ®,,,, &, which are consistent 
with equation (16) include many pairs which do not satisfy both equa- 
tions (11) and (12) for any given A. The latter equations of course 
specify the relations among ©,,,, and ®, which must exist in the com- 
munication problem under consideration. Nevertheless, we will work 
with equation (16) to perform the optimization and then show that 
the solutions for ®,,, and ®, can be realized with some modulator 
matrix A and, therefore, are consistent with the more restrictive equa- 
tions (11) and (12). 

Equation (14) and the assumed diagonal form of ©, and ©, allows 
us to rewrite equation (16) as 


ep. [63,65),63, , \] = c.p. [6,76,6;4, dl. 


As ®, and ©, are system constants not under the control of the user, 
any specification of 6, completely determines the roots of ®,?&,6;}, 
which we denote by {a;}, 7 = 1, 2, --- , N. The roots of 6;'6,,,,8,3 
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are also determined and are equal to {a;'}. Our claim now is that 
among all matrices @ with roots {a;’}, the one which produces the 
minimum trace of ®,), = ®!&6? is diagonal. 

If »;; are used to denote the elements of %, the trace of ®,,,, equals 


N 
trace ®,,), = >> Osis 


i=1 


At this point we impose, without loss of generality, that the variances 


a; be ordered such that of 2 of 2 --- 2 ox. Since the minimum 
trace of &,,,, is sought, clearly the diagonal elements ¢;; should cor- 
respondingly satisfy ¢1. S ¢oo S +--+: S gyn. This presents no re- 


striction on ® as a simultaneous interchange of rows and columns 
produces no change in the characteristic equation of . 

Now consider any nondiagonal candidate for the desired ®. In 
particular, let gn, = Gem, m > k, be nonzero. Because the submatrix 


&(km) = i “ 


Omk Cmm 


is itself a correlation matrix, it can be diagonalized by some orthogonal 
matrix T such that 


/ 
(km) = To(km)T' = * . | 

0 Ginm 
From (14) it is known that the characteristic polynomials of &(km) 
and ®’(km) are equal. The trace and determinant of each are therefore 
equal. It follows that of, = go, — cand ¢/,,, = Ymm + ¢;¢ > 0, or that 
the larger diagonal element is increased and that the smaller one is 
decreased. 

The diagonalization of the submatrix ®(km) within & can be effected 

by an orthogonal matrix Q which contains 7’ in the appropriate sub- 
matrix position and identity matrix elements in the other positions: 


Qi; = bi; } (2, ) = (k, k), (k, m), (m, k), (m, m) 
qi: = 6:; ; other (2, j). 


We then have ® = Q&Q‘ with only the elements in &’ in rows and 
columns k and m changed from those in ®. If # is used to generate 
a new correlation matrix @,, = 616’6} , we have 


wily 


N N 
tr ®, = Dd, owl: = dy ows, — coe — 0%) 


t=1 t=1 
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I 


tr ®,, aS c(o; a Tm) 
= tr Puy ’ (17) 


which establishes the claim of this section. That is, any nondiagonal 
correlation matrix ® with roots {a;*} conjectured as providing a 


minimum trace correlation matrix 3663 = @,,, can be improved 
upon by ©’. The desired matrix for ® is therefore diagonal and equal to 
@ = [a7 '6;;] (18) 

with the corresponding form of ®,,,, equal to 
®,1, = loja;'4;;]. (19) 


It follows that among all matrices ©®,), consistent with equation (16) 
with any given ®, , the one with minimum trace is diagonal. 

An identical argument yields the symmetric conclusion. That is, for 
any specified ©,,,, the matrix ®, with minimum trace among those 
consistent with equation (16) is also diagonal and equal to 


@, = [02,0254]. (20) 
The argument assumes only that the noise variances are ordered 
2 2 2 
On Ss One = aia = Onn - 


We can now state that the minimization of the trace of ®,,, over 
all pairs ©,,,, , ®, which satisfy equation (16) and the constraint equa- 
tion (13) is obtained with a pair of diagonal matrices parametrically 
related as in equations (19) and (20). Any pair not so related can be 
altered, one matrix at a time, to decrease either the error (trace ®,,,) 
or the received energy (trace ®,). Although we have worked with 
pairs ®,,,, &, consistent with equation (16) rather than the smaller 
set satisfying equations (11) and (12), the solution forms for ,,,, and 
®, are still valid as they do satisfy these equations. 

The modulator matrix which produces the correlation matrices ®,,\, 
and #, in the optimum form can be found from either equation (11) 
or (12) to be 


A= E (a; — D's |: (21) 
Equations (12), (14), and (15) and the fact that ®71A®,A‘é-? has 
nonnegative roots (it is a correlation matrix) can be used to show 
that a; 2 1,7 = 1, 2, --- , N which guarantees that the elements 


of A are real. It remains to solve for the set of roots {a;} which provides 
the desired optimization. 
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3.2 The Optimum Diagonal Modulator Matrix 
In terms of the set {a;}, the distortion which is to be minimized 
is given by 
N 
= trace ,, = , Oia; 


t=1 
and the received energy constraint by 
N 
Se = trace Ph, => >) 02,0; < So + No . 
t=} 


A further constraint is that a; 2 1,7 = 1, 2, --- , N. As the set of 
permissible a;’s is a convex set and the functions d(a;) and Sp(a;) are 
convex functions, the Kuhn—Tucker theorem is applicable.° This states 
that at the point of minimization: 


re) 1 ; 
2 la+4s,|=0 i as 1 


<0 if a, = 1. 


Therefore we have 


—alai? + “50%, = 0 if a; >1 


<0 if a, =1 





or 
a; = max ( ai ) ; 1}. (22) 

NC ni 
It has already been observed that a; 2 az 2 ++: 2 ay and that 
a; = 1 corresponds to a;; = 0 or no transmission of the 7th source 


component. If we let N’ denote the last a; strictly greater than one 
we have the following solution for the optimum modulator matrix 


ou (ox _ 1\'5 
A=l|0; fs 1) bi; 0 ; Law SN, (23) 
0 0 


The solution for the distortion in the optimum linear system follows 
directly from equation (19): 


N 


N’ 
d = a AGO ni a ey > o; ’ (24) 
t=1 i 


i=N’+1 
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as does the solution for the total received energy from equation (20): 


N’ 1 N 
Sr = DS ~ OOni + > ons . (25) 

izl r i=N’ +) 
In these equations, the parameter \ is chosen to satisfy the constraint 
in equation (13) with equality. It should be remembered in the solu- 
tion for \ that N’ is a function of \, being equal to the largest value 
of ¢ for which o;/c,; = . For completeness, we give the optimum de- 
modulator matrix: 





0; = 3 ” 
B= it 7 oa AN >; ilstigjsN’. (26) 


IV. ELIMINATION OF THE ASSUMPTIONS 


4.1 A Source and Channel with Nonindependent Components 


We now consider systems in which &,, and ©, are not diagonal. 
Let P and R& be the orthogonal matrices which respectively diagonalize 
these two correlation matrices, that is, ®,, = P&,P' and 6, = R®,R'‘ 
with &,,, and &,, diagonal. Using the previous results, we can find the 
optimum modulator matrix A’ in the primed system containing the 
correlation matrices ®,,, and ®,- . Now consider the use of the modulator 
matrix A = R'A’P in the system with ®, and ®,. From equation (6) 
and ®, = A&,A‘’ + 6,, it can be easily shown that using A’ in the 
primed system and A in the unprimed system each produces the same 
distortion and uses the same energy. Consequently, A must be the 
optimum matrix in the unprimed system. If it is not, and A, is better, 
A’ = RA,P‘ would be a better choice than A’ for modulator in the 
primed system contrary to A’ being optimum. 


4.2 Nonequal Source and Channel Dimensionality 


When NV, # N,, we can appropriately modify either the source or 
channel to restore the equality. For example, when V, < N,N, —N, 
source components of arbitrarily small variance, say «, are added to 
the original source vector. The optimum modulator is then found as 
a function of e by the previous method, and finally the limit taken 
as e goes to zero. Similarly, when VN, < N,, N, — N,, channel com- 
ponents of arbitrarily large noise variance, say 1/e, are added to the 
original channel, the optimum modulator found, and the limit taken 
as e goes to zero. We have seen that whenever either the source has 
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a component with small variance or the channel has a component 
with large noise variance, the number of source components actually 
transmitted, N’, is smaller than N. Since the optimum modulator 
matrix is diagonal, N’ is also the number of channel components 
actually used. Therefore, the limiting modulator form in both of the 
above situations is attained for a nonzero value of ¢, say e,. This 
modulator form is then optimum for all e < « ¥ 0. 


V. COMPARISON OF OPTIMUM LINEAR AND NONLINEAR MODULATORS 


In 1959 C. E. Shannon introduced a relation between dz , the min- 
imum attainable transmission distortion of a source, and R, the total 
information rate used in transmission.” This relation involves only the 
source statistics and the distortion measure in use. From it one is 
able to conclude that any channel with capacity FR can be used to 
transmit the source with a transmission distortion arbitrarily close 
to dz . One need only use a ‘‘sufficiently complex” encoder and decoder. 

Another part of rate-distortion theory is the idea of a ‘‘test channel.” 
Associated with each point on the rate-distortion curve, (dz, PR), is 
such a test channel which has the significance that among all channels 
that transmit the source at a rate equal to R, it provides the minimum 
transmission distortion dz. Therefore, if there exist pre- and post- 
operators which can transform a given capacity R channel into the 
test channel for the source at (dz , R), these operators must be optimum. 
An obvious necessary condition for this transformation, which is not 
always met, is that the capacity of the test channel at (dz , R) be equal 
to BR. 

For a gaussian source with variance o” and squared difference distor- 
tion, Shannon has found’ both the rate distortion expression, dg = o’e”” 
and the test channel: 


(27) 


Ss 
i} 
zs @ 
7 
S 


In this reverse channel, # and n are independent gauss variables with 
respective variances o”° — dz and dz. It can be shown that this channel 
is identical to the forward channel: 


wr Q-OB-wa (28) 
co 
A, on 
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with A, = (0° — dz)/o’, o2 = Aidg, and the independence between 
w and n. A similar form is given by Gallager in Ref. 7. Still another 
form of the test channel is: 


w> @Q-O- WO wv (29) 
ce” ae. 
A n B 


with A? = (o° — dp)o?/o'dpz, B’ = (o° — dp)dr/o'o?, and n any 
given additive gaussian noise. 

Now consider a single dimensional gaussian channel of capacity R. 
Since the received energy Sz is accordingly restricted to o? exp (2R), 
we have from equations (23) through (26) that the optimum linear 


operators are 
2 Ge 1) =3(¢=4) 
a as a” \\on — d 


| 








d(o? — d 

hex (e-1)= "S34 
d _ Fon, 
ot, Sp 


Note that the distortion d equals o” exp (2R), and that a; and bj, 
agree precisely with the test channel parameters in (29). Therefore, 
we can conclude that in this case the operators in equations (28) and 
(26) are optimum, even outside the linear class. 

The rate-distortion curve and the test channel for gaussian vector 
sources can also be found from Shannon’s results. The results for the 
N-dimensional source with variances of , 03, --* , oy are (we continue 


to assume that of = o3 2 ++: 2-ay): 


N 1/N 
wie" I] “| ; 0 <s dr < Now 


t=1 


dz 


N-1 1/N-—1 
ov + (N — Dye" Il ai} ; 


a=1 


Noy S dz S on + (N — Vow-s 


N-2 1/N—-2 
ov + ov + (N — nem II a} ; 


iw-l 


oN + (N — lon-1 Sd, oN + CNA + (NV = 2)an—2 
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oye bop + ae; 
oy-+ +++ +203 S dp Son+t--- +07 
=ovnt--- +o; R=0. (30) 


This expression can also be applied to a gaussian vector source with 
correlated components if the variances o; are interpreted as those in 
the diagonalized correlation matrix @,. = P,P‘. The test channel 
for N > 1 is the product of elementary test channels given in (29) with 


A = A,,A2,°°:,An, 
2 (o; — d,) 2 

Ai = od; Oni » 

d; = min (co; , dr), 


Oo, = 0,072, °°* , Oy = any noise vector. 

Let us now presume that the vector channel provided for use has 
the additive noise variances given by the vector o? and is constrained 
to have an output energy level equal to S, . This equivalently specifies 
the channel capacity as 


Sod aes Sn 
as sau fa1 2 log Oni on 


with 
Sri = max (S, Oni) 


and S adjusted to have » Sri = Sz. The comparison between the 
minimum attainable transmission distortion using linear transmitter 
and receiver operators (equation 24) and using unrestricted transmitter 
and receiver operators (equation 30) now reveals that contrary to the 
single dimension case, when N > 1 the linear operators are not, in 
general, optimum. The only exception is when both the vectors o° 
and o% are uniform. Some intuition as to why the single and multi- 
dimensional cases are different might be provided by the following. 
The test channel at (dz, &), for example, the one including the 
noise vector o2 in its form, is a result of a minimization of mutual 
information under a distortion constraint. It does not, therefore, 
necessarily divide the total energy presented to the gaussian vector 
channel in a way which uses this channel to capacity. Since this channel, 
by definition, transmits information at a rate equal to R, its total 
capacity is (except for the special case noted previously) strictly 
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greater than R. Consequently, when the same additive noise channel, 
o, is to be used for transmission but is stated to have a capacity 
of only R&, it cannot be transformed into the test channel by any pre- 
and postoperators. 

The impossiblity of such a transformation can also be observed 
by noting that the allowed total input energy on the given capacity 
FR channel is restricted to a lower level than present on the test channel 
The uniqueness of the test channel, which is formed with linear op- 
erators, and the continuity of both the mutual information and dis- 
tortion with the modulator matrix then precludes the possibility of 
attaining the test channel’s performance with the given capacity R 
channel and linear operators. | 

One could argue that the comparison to this point is not fair in 
that Shannon allows modulators and demodulators that operate on 
blocks of letters, whereas the results in equations (23), (24), and (25) 
were derived using a coding block length of one. However, the previous 
results show that the optimum linear modulator does not mix indepen- 
dent source components before presentation to the channel, assuming 
the channel has already been rotated in N-space so as to have indepen- 
dent noise components. Neither does it cross-couple sets of source 
components having no cross dependence when presentation is to a 
channel with sets of noise components of equal respective dimen- 
sionalities also having no cross dependence. Therefore, if successive 
source and channel (vector) events are independent, and their dimen- 
sionalities filled out to be equal by adding either zero variance source 
components or infinite variance noise components, there is no memory 
introduced by the optimum linear modulator among elements of the 
encoded block. The consequence is that the distortion and the energy 
are only scaled by the block length in use. 


VI. AN EXAMPLE 


We cite here just one example which shows that at least in many 
cases the performance of the optimum linear modulator- demodulator 
pair compares favorably with that theoretically obtainable with more 
complex operators. We take o, = o2 = 1, on, = @, ong = ae’® and use 
a and ¢ as parameters that generate a set of different channels. To 
better compare the two performances, we fix the channel capacity at 
C which in turn fixes Shannon’s minimum attainable distortion at 
de = 2e°°. The total allowed received energy is thus specified ac- 
cording to equation (81). 

Upon solution for \ and d in equations (24) and (25) we have the 
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following expression for the ratio between the distortion obtainable 
with linear operators and that theoretically attainable: 





cosh’ g 059870 
dy) ___csh’y . vege 
ic eee oO) A= eSe 

cosh C : C2. 


We illustrate this function for several different values of capacity in 
Fig. 2. At ¢ = 0 (where both the vectors o” and o? are uniform) we 
see that d(0) = de indicating the optimality of the linear modulator 
and demodulator for this case. Using a term introduced in Ref. 8, 
we can therefore say that when ¢ = O the source and channel are 
‘‘matched.’’ As 9 increases, the source-channel mismatch increases and 
the nonoptimality of linear operators also increases. As the figure 
illustrates, the nonoptimality ratio, d(v)/de , can be quite large when 
both the channel capacity is high and the additive noise vector is 
highly skewed in variance. However, over a significant region of interest, 





























Fig. 2 — The linear system nonoptimality for N = 2,0: = o2 =1,0m = 1,on2 = exp 2¢. 
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gy S 1 (reflecting a noise component variance ratio of about 50), the 
nonoptimality ratio is small. 


VII. SUMMARY 


In this paper we have derived the optimum linear modulator and 
demodulator for the transmission of a gaussian vector source through 
an additive gaussian vector channel. It was found that when both 
the source and channel components are independent, both the modulator 
and demodulator matrices are diagonal. This specifies the separate 
amplification, transmission, and decoding of each source component. 
When both the source and channel components are correlated, the 
optimum modulator matrix was found to be the cascade of three 
matrices: (i) the orthogonal matrix which diagonalizes the source 
correlation matrix, (22) the optimum modulator matrix which transmits 
this newly formed independent component source over the independent 
component additive noise channel which is formed by (777) the orthog- 
onal transformation matrix that diagonalizes the noise correlation 
matrix. We have found that in general the best linear system does not 
provide a distortion as small as that stated by Shannon to be attainable 
with a channel of the same capacity. The only exception is when both 
the source and channel noise variance vectors are uniform. The non- 
optimality of linear modulators and demodulators can be quite large 
in some cases but, in many other situations, can be small enough to 
justify the use of these very simple operators. 
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Communication Systems Which Minimize 


Coding Noise 


By BROCKWAY McMILLAN 
(Manuscript received May 28, 1969) 


The problem of minimizing coding or quantizing noise in a communica- 
tion system is posed in a general setting. It is shown that if the messages to 
be transmitied are sample sequences drawn from a discrete-time random 
process meeting a certain simply stated criterion of ‘‘randomness’’ and if 
there exists a quantized communication system which is optimal in that 2 
introduces a minimum amount of coding notse, then this optimal system 
can be realized using a transmitter of special form. Specifically, the opti- 
mum transmitter is one which quantizes each message sample according to a 
scheme that depends only upon the quantized material already transmitted, 
rather than upon the (unquantized) material that has been previously offered 
for transmission. It follows that only digital storage is required at the 
transmitter or receiver. If the receiver ts limited, a priori, to have only a gwen 
finite amount of storage, and tf the system is optimum within this con- 
straint, the transmitter need have only the same amount of storage. 


I. INTRODUCTION: THE MODEL 


Shannon’s theory of communication, shows how to defeat noise intro- 
duced in a communication medium by restricting the repertoire of trans- 
mitted signals to a discrete set.’ If the messages to be transmitted are 
not already in an appropriately discrete form, noise in the medium is 
then eliminated only at the expense of noise, here called coding noise, 
caused by the failure of the restricted family of available signals to 
represent faithfully the full family of possible messages. The amount of 
coding noise introduced is of course subject to control by design. 

This paper considers one aspect of the problem of minimizing coding 
noise. Noise in the medium is not considered. The paper limits attention 
to systems in which the random process representing the message is a 
discrete-time or sampled-data process. The sampling noise caused by 
creating such a process out of a continuous-time process is not considered, 
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The problem of selecting a coding scheme that maximizes the rate of 
communication over a noisy channel is not considered. Rather, the paper 
starts at the point that a coding scheme has been found, that is optimum 
according a fairly general criterion of fidelity. What is then shown is 
that the transmitter and receiver—encoder and decoder—of the system 
are of a special form. 

A Q-coded communication system is defined by a discrete set Q and by 
three jointly distributed random processes, {%, ; Qn » Yn | % = 0, +1, 
+2, ---}. For purposes of this paper, the set Q will be either 


(z) the set {1, 2, --- , M@}, where I/ is a given positive integer > 1, or 
(iz) the set {1, 2, 3, ---} of all positive integers. 


The process {x,} represents periodic samples derived from the message 
offered for transmission, each 2, is a real random variable. {q,} represents 
the transmitted signals; for each n, qg, is a random variable, taking values 
from the set Q and measurable on the sample space of {2,, %n-1, 
In-2, ***}. That is, for each n, the value of the integer variable q, 
depends only upon, and is determined (apart perhaps from events of 
probability zero) by the present and past of the message. {y,} represents 
the version of the message reconstructed at the receiver; for each n, y, 
is a real random variable measurable on the sample space of {qn , Qn-1 ; 
Qn-2 , ‘* +}. Therefore for each n, y,, depends only upon, and is determined 
(apart perhaps from events of probability zero) by the present and 
past of the transmitted signal. 

The model at this point is very general. It provides that at each time, 
n a discrete valued random variable q, be generated in some way out of 
the material {xz, , %-1 , Zn-2 , °**} then available from the message 
process, and that subsequently at the receiver a y, be generated out of the 
material {q, , Qn-1, °°*} there currently available. If all three processes 
{Un Qn) Yn} are stationary we can call the system stationary. The ques- 
tion of stationarity does not enter in what follows. 

What remains to be specified in this model is that in some sense the 
process {y,} is to represent the process {z,}. At the start it appears 
natural to consider three cases; it develops that two are simply special 
cases of the third, one of them not interesting in the framework of this 
paper. 

We start with a given sequence {y, |” = 0, +1, +2, ---} of functions, 
in which each y, is a real valued Borel measurable function y,(«, y) of 
the real variables x, y. The use of a sequence {y,} here is a largely deco- 
rative generality that costs nothing. The conventional case is that in 
which all y, are the same function y. These functions define a fidelity 
criterion as follows; 


CODING NOISE 3093 


Case (2), the delay-free case: 


Here we choose to regard y, as a replica of x, , and evaluate our 
communication system at each time n by the quantity 


EXWn(@n » Yn) }s (1) 


where // denotes expectation over the message ensemble. 


Case (iz), the case of fived delay: 


Here we are given a fixed integer d = 0 and we choose to regard y, 
as a replica of 2,-2 , thus allowing q, to take advantage not only of 
{Una )Un—a-1, ***} (the present and past of z,_2) but also of {%,,2n-1) °°"; 
Ln—a+1} (a limited span of the “future” of x,_.) in representing 7,-2 . 
Here the criterion relative to 2,_2 is (by a convention we will use with 
respect to indices) 


E{Wn(Sn-a » Yn)}- (2) 


If d = 0, this case reduces to case 7. 


Case (i727), block encoding with cycle time c: 


This is the situation that arises naturally in Shannon’s theory. We 
are given a fixed integer c = 1, and the transmission process is repetitive 
with a cycle of length c. By a choice of time origin, we can describe it as 
follows. Let Q, be a discrete set with 17, < «© members. At time 0 the 
transmitter examines {x , v_,, ---} and generates a Q,-discrete variable 
which we shall call g, . At time c, the transmitter then examines {z, , 
L,-1,°-**} and produces @, ; the process repeats with period c. For trans- 
mission, the random variable @, is encoded into the string {q, , Q--1)"** 5 
q:} of random variables each being Q-discrete, where M° = M, . At 
time c, all of Go), G-1, °°: are available at the receiver, being rep- 
resented by the sequence {q. , Qc-1 ) Ge-2, °** }. From these, the sequence 
{Y2e-1) Yae-2» *** » Ye} is generated, representing ro, %-1, °° » V-e+15 
respectively. We think of these y’s as being presented to the output 
of the receiver in the order of their indices, y, at time c, and so on. 

If one follows through the functional dependencies here, he sees that 
indeed the processes {2, ; Qn , Yn} are so related that each gq, depends at 
most upon {%, , Za-1, °*+}, and each y, at most upon {q, , Qr-1, °*°}- 
Indeed, except at times which recur with period c, q, is not “‘up to 
date,” depending in fact only on 2’s strictly prior to x, . Similarly, y, is 
only periodically up to date; at other times it depends only upon q’s 
that are actually earlier than q, . 

In the situation as just described, the criterion of fidelity becomes 
E{Wal@n-2e+1 » Yn) }. Case 272 is then also a special case of case 77, in which 
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d = 2c — 1 2 1. What makes it special is that in case 77, g, and y, 
are permitted to be up to date at each value of n, however in case 777 
the block coding process restricts the currency of the data upon which 
most of the q’s and y’s depend. 

Actually, case 777 as just described will turn out not to be covered, 
in general, by the theorems to be proved. This happens because, as is 
later be stated more precisely, we are interested only in communication 
systems that minimize (2) for each n, in comparison with all possible 
competing systems. Clearly, to impose the restrictions immanent in 
case 777 upon one’s reportoire of coding schemes limits the domain 
within which a minimum is to be sought. The system that brings 
about an absolute minimum is simply not, in general, to be found 
in this restricted domain. 

The previous observation is not to be entered as a criticism of Shan- 
non’s theory. Typically, in a noisy medium, it is necessary to use a 
highly redundant encoding {q, , g--1, °** » Gi} to represent G , so that 
the inefficiencies (as measured by expression 2) that are imposed by 
the block-coding process are needed in order to ensure that the y, 
in (2) is an approximately error free replica of z,-, . We must remember 
that (2) measures the noise introduced by the coding process, not by 
the noisy medium. It is interesting to a designer only if the latter 
noise has been eliminated. The price of this elimination is that one 
may not be able to minimize (2) in competition with systems that 
are not restricted to be of block coding form. 

A true engineering solution to the problems reflected in the remarks 
immediately above would consider (2) in which the expectation is 
taken over the joint ensemble of message and noise. The solution 
should balance coding noise against channel noise at, say, a fixed delay, 
to minimize (2). This paper is very far from solving such a problem. 

It does not follow that the results of this paper are without interest 
in the search for coding schemes to eliminate noise. Given a Q-coded 
communication system which does minimize (2), the {g,} process is 
in digital form. This {g,} process can then be redundantly encoded 
according to Shannon’s theory, and recovered with few errors (and 
typically much delay) at the receiver. The {y,} process then results 
(perhaps delayed) and has few errors. Then (2) does measure the 
total amount of noise introduced in this operation. 


II. STATEMENT OF RESULTS 


Given the message process {z,}, the sequence {y,}, and the delay 
d 2 0, a Q-coded communication system {x , qn » Yn} will be called 
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{y, , d}-optimal if 
(t) For each n = 0, +1, +2, --- 
E{| Wa(tn-a » Yn) |} < %, (3) 


and 
(22) For any other Q-coded communication system {z, , gf , yZ}, 


E(Wn(Cn-a ) Yn) } = Ely, (ah ? yt, (4) 


for each n = 0, +1, +2, ---. 

The simplest result of this paper is of such a form as to illustrate 
the nature of all of the results. We define a class K of functions y, 
and a class, here called CCD, of message processes {x,}, such that 
the following theorem is true. 


Theorem 1: Let {X, , Gn» Ya} be a given Q-coded communication system 
that ts {y, , O}-optimal. If each ye K, n = 0, 41, +2, --- , and if 
{x,} e CCD, then each q, ts equal with probability one to a random variable 
measurable on the sample space of {Xn , Qn-1 ) Gn-2, °**}- 


The force of this theorem is that it simplifies, in principle at least, 
the requirements for memory at the transmitter. Only the digital 
sequence {Qn-1 ; Q,-2 , -*:} need be in storage at time n. The proof 
of the theorem will also develop a standard structure for the optimum 
transmitter difficult to summarize easily in a theorem. 

The definition of the class K is long and is deferred to Section III. 
Suffice it here to say that K is a large class that includes the conventional 


Viy=lze-y|, Vayv=(e-y 
and any other continuous strictly increasing function of y’. 
We define CCD, and a related class CCDf, thus: 


CCD consists of those processes {z,} such that: for each n = 0, 


+1, +2, --- , if 2 is a random variable measurable on the sample 
space of {2-1 , Zn-2, °*+}, then the probability that z = z, is zero: 
P{z = x,} = 0. (5) 


CCDf consists of those processes {z,} such that: for each n = 0, 
+1, 42, --- , if A is a finite Borel field or the completion of a finite 
Borel field, and if z is a random variable measurable on the smallest 
Borel field containing A and the sample space of {%,-1 , 2-2, °°°}; 
then (5) holds. 
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Read CCD as “continuous conditional distribution.” If {x,} « CCD 
and if x, has a conditional distribution given {z,_; , t:-2 , ---}, that 
distribution must be continuous. 

We now define a more restricted class of Q-coded communication 
systems and a corresponding notion of optimality. 

Given an integer m = 0, a Q-coded communication system {2n, Qn» Yn} 
will be said to have decoder memory span m if for each n = 0, +1, 
+2, --- y, is measurable on the sample space of {Gn , Qn-1) ‘°° » Qn-m}- 

A Q-coded communication system {2 , Q , Yn} will be called 
{v, , d, m}-optimal if it has decoder memory span m, if (8) holds for 
every n, and if (4) holds for every n and for every {z, , q/ , yi} which 
has decoder memory span m. 

In the case of {y, , d, m} optimality, then, the competition is re- 
stricted to systems with decoder memory span m. We can put m = © 
to refer to the case of {y, , d} optimality defined earlier. 

Perhaps our most surprising result is that case 2 of our model, 
which includes case 7 as a special case, is also included in case 7. This 
is shown by Theorem 2. 


Theorem 2: Let {Xn , Qn » Ya} be a given Q-coded communication system 
that is {y, , d}-optimal. If each y, « K, n = 0, +1, +2, --: , of M, 
the number of elements of Q, is finite, and if {x,} « CCDf, then each qn ts 
equal with probability one to a random variable measurable on the sample 
space of {Xn-a) Gn-1» Gn-2, °°}. Furthermore, the system {x, , qf , yZ}, 
where 


eA pa. BAD sees (6) 
Yn ae Yn+a ? 

is a Q-coded communication system that ts {pi , 0} optimal, where 
Yn = Vora» n= 0, ol = = (7) 


Iinally, we state a theorem that includes the two preceding ones. 


Theorem 3: Let {x, , Gn, Yn} be a@ given Q-coded communication system 
that is {y, , d, m}-optimal. If each ¥, « K,n = 0, 41, +2, --- , af 
M < o, and if {x,} « CCDf, then each q, ts equal with probability one 
to a random variable measurable on the sample space of {Xn-a , Qn-1 ; 

>» Gn-m} ({Xn-a} if m = 0). The system as defined by (6) is a Q-coded 
communication system with decoder memory span m that is {yi , 0, m}- 
optimal, where Wi ts given by (7). If, in the initial hypotheses, d = 0, 
then it suffices that {x,} e CCD and the restriction M < © may be removed. 
If m < ©, the hypothesis {x,} « CCDf may be replaced by: 
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For each nn = 0, +1, +2, --- , if 2 ts a random variable that takes 
only finitely many values, then P{x, = z} = 0. 


Theorem 1 shows the basic facts about measurability in the present 
context. Theorem 2 adds the fact that delay d > 0 gains no advantage 
(since the “future” of x,_, is not known at the receiver, even if it is 
at the transmitter). Finally, Theorem 3 includes these facts and shows 
that a limitation on the memory span of the receiver allows a cor- 
responding simplification of the transmitter. 

In the proofs of these theorems it is seen that they are true for classes 
of process slightly larger than CCD or CCDf. In particular, the final 
conclusion of Theorem 3 opens the case of finite memory span to any 
process {z,} that has a little additive nonsingular Gaussian noise in 
each sample. 


III. THE CLASS K 


The class K of cost functions allowed by these theorems can be 
very general. The definition below seems more inclusive than is called 
for by the applications I can think of; at the cost of elaboration, it 
can be enlarged further. 

We let K be the class of all functions y(2, y) of two real variables 
x, y with the following properties. 


(2) wa, y) is continuous; 

(i) for all x,y, — ¥(z, y) = 0; 

(it) for all 2, v(a, x) = 0; 

(iv) for each y, there are at most countably many solutions x to the 
equation 


W(x, y) = 0, (8) 
in the sense that: there exist Borel measurable functions g,(y), k = 
1, 2, 3, --- , such that if (8) holds, then for some k, x = g,(y). 


v) If y: # ye , there are at most countably many solutions to the 
equation 


W(X, yi) = WC, Yo), (9) 
in the sense that: there exist Borel measurable functions f,(y, z), k = 
1, 2, 38, --- such that if (9) holds and if y,; ¥ y,. , then for some k, x = 


fils y Ye): 
It follows from this definition that y’ « K, where y'(z, y) = |x — y |. 
Then also y’ « K, where y’(z, y) = (x — y)’. Similarly any other con- 
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tinuous strictly monotone function of y’ is also in K. In all of these 
instances, (8) has the unique solution y = 2x, and (9) has a unique 
solution given by 22 = y; + Ye. 


IV. PROOFS 


Let {2, B, P} be a probability space: A set Q of points w, a Borel 
field B of subsets of Q, and a probability measure P on B with respect 
to which B is complete. This probability space is assumed given and 
fixed. 

A random variable x is a real-valued function x(w) defined on Q 
and measurable B. 

If F C Bis a Borel field, a random variable z is said to be essentially 
measurable F if x is equal with probability one to a random variable 2’ 
which is measurable F. If F is complete, such an x is then itself meas- 
urable F. 

If F C B is a Borel field and x a random variable, {x} V F denotes 
the smallest Borel field such that: x is measurable {x} V F and 
BEC fa} VF: 

A random variable taking its values in the set Q will be called Q- 
discrete. 

Denote by [x | g, F | y, G] a mathematical object of the following 
kind: 

x is a random variable, 

q is a Q-discrete random variable, 

F is a Borel field, F C B, and q is essentially measurable on the field 
determined by F and the sample space of z, 

y is a random variable, 

G is a Borel field, G C {x} V F, and y is essentially measurable on 
the field determined by G and the sample space of q. 


For convenience let CQAz (‘conditionally quantized approximation 
to x’) denote the class of all objects of the kind described, based on 
the given probability space {Q, B, P}, the given z, and the given set Q. 

Given a Q-coded communication system {z, , gd, , yn}, given a delay d 
and a memory span m, let X,,z be the sample space of the selection 


{Xn , Um—-1 , ***} Of random variables from which the specific variable 
XYn-a has been deleted. Let Q,,, be the sample space of the random 
variables {qn-1, Qn-2) °°* » In—-m}- Then it is easy to see that {z,, dn, Yn} 


is a Q-coded communication system with decoder memory span m 
if and only if for each n = 0, +1, +2, --- 


[Tne | Qn » Xue | Yn » 0; al € CQAZn-« . 
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Given y, a [x |g, F | y, G] e CQAz will be called weakly Y-optimal if: 


(7) E{| ¥@, y) |} < @, 
(it) If random variables q’ and y’ are such that [x | q’, F| y’, G] e CQAz, 


then F{y(a, y)} S B{y@, y‘)}. 


The qualifier “weakly” in this definition signals the fact that the 
fields F and G are not allowed to vary in the competition for optimality. 


Lemma 1: If {Xn , Gn, Ya} 18 @ Q-coded communication system with 
decoder memory span m, and tf {X, , Ga , Ya} 78 {Yn , d, m}-optimal, 
then for each n [Xn-a| Ga, Xn,a | Ya » Qn.ml 78 weakly y,-optimal. 


Proof: Fix an n; for convenience identify it as n = 0. Suppose that 
we are given random variables q’ and y’, which we shall here call qj 
and y/ , such that 


[v2 | 9, Xo. 1431 Qo.ml © CQAR: 
Define a new Q-coded communication system {z, , gi, y4} thus: 
For n <0, % = dns Un = Yn 
For n = 0, gj and y/ are those above; 
Forn > 0, qf = land yf = 0. 


That this is a Q-coded communication system with decoder memory 
span m follows at once from the definitions. Furthermore, the sample 
space of {q2,,qQ/2,°-* du} is Qo,m- Because {an , dn, Yn} 18 {Yn , d, m}- 
optimal, we conclude that £{| Po(x-a, Yo)|} < © and that EH {yo(a_a, Yo) } 
Ss Ef o(t- ? yb) }. 
These, however, prove that [x | qo , Xo. | Yo » Qo.m] is weakly yor 
optimal. Clearly this proof can be repeated for any other value of n. 
The proof of this lemma indicates, deliberately, the force of the 
notion of {y, , d, m}-optimality for {x, , dn, Yat. The competing com- 
munication system {z, , q’ , y{} used in the proof sacrificed all reason- 
able behavior for n > 0, yet was still allowed to compete at n = 0. 
In particular, notice that even if {2 , qd, , Yn} is stationary, it must 
compete with nonstationary systems designed to excel at only one 
value of n. The theorems of Section II are not proved for stationary 
systems which are known only to minimize each E{y,(%i-2, Yn)} 
against competing systems drawn from the class of stationary systems. 
Given a Borel field G C B, we define CCD(G) analogously to CCD: 
CCD(G) is the class of all random variables x such that: 


If z is a random variable measurable G, then P{x = z} = 0. 
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The results of this paper all derive from Theorem 4. 


Theorem 4: Let [x | q, F | y, G] e CQAx and suppose that it is weakly 
y-optimal. If Q is a finite set, or if YW is Borel measurable and for each 
x 7s bounded from below, then there exists a Q-discrete random variable 
q’ and a random variable y'’ such that 


(i) ix|q’,G|y’, G]e CQAx, 

(2z) w(x, y’) = Wx, y) with probability one. In particular, also, the 
object 1 1s weakly y-optimal. 

If We K and x e CCD(G) then also 

(itt) q’ = q with probability one, and 

(iv) y’ = y with probability one. 


It then follows that the given q ts essentially measurable on the Borel field 
{x} vG, determined by G and the sample space of x. 
We wish to use the given [x | g, F | y, G] as a model for some 


[any | Qn ) Xna | Yn 5] Qn, ml 


in a Q-coded communication system. Conclusions 7 and 77 show that 
for any given n we can find a q/, essentially measurable {2,2} V Qn.m 
and a y/ such that, according to the criterion defined by y, y/, represents 
Xn-¢ as Well as y, did. Without conclusion 727, however, the substitution of 
q;, for gq, can alter the subsequent Borel fields Q/,,,,, & 2 0, to the point 
that we are no longer sure that [t,+2-2 | Qian) Xnoz.a| Yoon, WUarml, & > 0 
is weakly Y,+,-optimal. Without 222, therefore, one cannot apply The- 
orem 4 to prove the other theorems. 

It is convenient now to invoke a lemma which is a simple theorem 
from measure theory. The lemma provides a standard form for the 
variables g and y of an object [x | g, F | y, G] © CQAz. 


Theorem 2: Given a Q-discrete random variable q and a Borel field G, 
if y ts a random variable measurable on the Borel field determined by G 
and the sample space of q, then there exist random variables {z, , pe Q} 
such that 


(t) each z, 1s measurable G and 

(iz) for each w 2 Q, if q@w) = p then yw) = 2,(w). 
Conversely, of course, given {Z, , p ¢ Q}, each measurable G, any y defined 
by 12 1s measurable on the field determined by G and the sample space of q. 


The proof of this lemma consists in showing that the class of random 
variables of the type of y above, as the {z, , p « Q} are selected arbi- 
trarily from the class of variables measurable G, exhausts the class 
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of all random variables measurable on the Borel field determined by G 
and the sample space of g. The proof is a straightforward exercise in 
measure theory and is omitted. 

To begin the main argument, given [z | gq, F | y, G] « CDAz and 
a Borel measurable function (2, y), if for each x ¥(a, y) is bounded 
from below, or if Q is a finite set, we can define the random variable 


Ew) = a Y(t), 2-()). 
Then é is measurable {x} VG. 
Given p ¢ Q and r e Q, we define sets T* , T,, , T, by 
{w | ¥(x@), 2p) = EW)}, 
Tor = {w | ¥(e@), %@)) = Vee), 2,@)}, 
T, = T* — UT... 


r#p 
reQ 


3 


Clearly each of these sets is measurable {x} V G. T* is the set where 
the index p minimizes (x, z,), and 7’, is that subset of 7* where this 
minimizing index is unique. It follows that if r # p then 


LNT =, (10) 
and as a consequence, 7, A JT, = 6,7 ¥ p. 
Clearly 
ee! Le 
Also 
Tt A Ty, = TEA Tye, (11) 


since either side is the set where an index minimizing y(z, z,) can be 
equal either to p or tor. 

In terms of these sets, the argument to be used can be outlined 
briefly. First, one shows that the T* essentially cover Q, in the sense 
that there is a null set N such that 

Q-N= UT. (12) 
peQ 
This follows without argument, and with N = 4¢, if Q is finite; it results 
from y-optimality in general. 
Second, by definition 


fie a i 


IIA 


oy ae (13) 
reQ 
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Third, one observes that for p, re Q and p ¥ 1, T,, consists of the 
set S,, 


Spr = {@ | ao) = 2-(@)} 


plus a disjoint remainder 7',, — S,, . The hypothesis x e CCD(G) allows 
one to show that this remainder is a null set. Over the set S,, , on the 
other hand, the information about x conveyed by the family {z, , pe Q} 
is redundant. The hypothesis of y-optimality can then be violated, 
unless S,, is also a null set. It follows then that each T,, is a null set, 
and from (12) and (13) then that the 7, partition Q apart from a null set. 
From this the full theorem follows quickly. 
To proceed with (12), given pe Q, let N, be the set 
N, = {| q@) =p} A (9—- UTS}. 

Fixanwe N,; then y(w) = z,(w) but w ¢ T* , so that E(w) < P(x(a), z(w)). 
It follows that there is some re Q, r ¥ p, such that 


¥(xw), 2-()) < ¥@), 2), (14) 


and indeed, since Q is bounded from below, that there is a least such r, 
call it r*(w). Notice that N, is measurable on the Borel field determined 
by the sample space of {x}, by F, and by G. SinceG C {x} V F, it follows 
that N, is measurable {x} V F. That subset R,, of N, where r#(w) = k 
is empty if k = p; otherwise 

Ro, = Nz A tw | ¥@@),a@)) < ¥@@),%@)} if k= 1p, 


Roy om N; A {w | Y(x@), 2, (w)) < V(r), Zy(w)) } A 


1 


tod 


{o | ¥(e@), z;()) 2 ¥@@),%@))} if k>1, kAp. 


j= 


_ 


It follows from these equalities that R,, and r* are measurable {x} V F. 
We now define the Q-discrete random variable q’ by 


IfpeQandweN,, gw) = re); 
If we 2 — \U.¢ N,, then q/(w) is the least value of re Q such that 
woe T*, 


Since the NV, cover the complement of \L), T* , and since Q is bounded 
from below, this defines q’(w) for each w e Q; clearly q’ is Q-discrete. 
Given k « Q, the set where q’ S k consists of the union of 


U Rox 


peQ 
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with the set V, , where 
V, = TH 
V,=(@Q—-—T*HA--- A (Q—TE,) A TF, eS 1, 
Since each V, is measurable {x} V G C {a} V F, it follows that q’ is 
measurable {x} V F. Furthermore, over 2 — \).9 N,, q’ is equal to a 
random variable that is measurable {x} V G, since each V; is measurable 
on this latter field. 
We now define the random variable y’ by 
y’ (w) vo Zq'(w)(w), we Q, 


Then y’ is measurable on G and the sample space of q’. It follows that 
[x | q', F | y’, G] e CQAz, and from the hypothesis of weak y-optimality 
then that 


E{v(a, y)} S Ely, y')}. (15) 
But now we claim that for all we Q 
¥(z(w), y’()) S Ye), y)). (16) 


First, if we N, , we have 
W(xz@), y’@)) = VCH), 2-0) < YW), 2%) 
= ¥(z@), y@)), (17) 


the inequality being by definition of r* . Therefore strict inequality 
prevails in (16) for we \L),.9 N,. Consider now anwe (Q — \U..9 N,) A 
{w’ | g’@w’) = p}. For this w we have we T* and ¥(a(w), y’@)) = 
V(x), 2,(w)) S V(x), z,(w)) for any re Q, by definition of T* . But 
then (16) follows for this w because y(w) = 2,(w) for some re Q. 

Now from (16), by taking expectations, we conclude the inequality 
opposite in sense to (15), hence (15) is an equality, and (16) is then 
an equality with probability one. Therefore 72 of Theorem 4 is proved. 
Now by (17), (16) is a strict inequality over N = U9 N, . Hence 
this latter set is a null set. Therefore 7 of Theorem 4 is proved, since 
q’ is equal, over the complement of N, to a variable that is measurable 
{x} VG, as we noted earlier. Finally, since 

eo- Ure = UN,=HN 
peQ reQ 
the 7* essentially cover 2. This is (12), as was to be proved. 
It would be possible at this point to invoke the hypotheses y ¢ K 
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and x e CCD(G) to conclude iw of the Theorem. It will be more effi- 
cient to prove 727 and ww together. To do so requires, as our earlier 
outline suggests, that we examine the sets T* A T,, over which re- 
dundancy prevails (because on T* A T,,, either of z, or z,, where 
r ¥ p, could be used to define the same value of y minimizing y(z, y). 

We have concluded (12), that except for we N, a null set, for each 
w there is at least one p e Q such that Ew) = (x w), z,(w)), that is, 
the minimizing index is uniquely p for we T, — N. 

Now define, as earlier, for r ¥ p, 


Spr = {w | z(w) = z-()}. 
Then if we T,, — S,, , we have 
¥(x(w), 2p(w)) = W(t), z-(@)), — %(w) # 2). 
Since ye K, it follows that for some k = 1, 2, --- we have 
tw) = filer), 2-(w)). (18) 


Now let A,», be the set of all w such that (18) holds. We have just 
showed that 


T= ee NI Aa. (19) 
k=1 


But now, since f, is Borel measurable and each z, is measureable G, 
(18) constrains x on A;»,, to be equal to a random variable measurable G. 
Since x e CCD(G), then A;,, is a subset of some null set, 


P{Air} = 0, k=1,2,---, 


and 


» ie { Apor} = 0. 
k=1 
This last with (19) makes P{T,, — S,,} = 0. Indeed, finally, since Q 
is countable, 
P({UY UTC, — 8,)} = 0. 
peQ sie 
It is important later that by definition, S,, is measurable G and 
therefore that, by (19), 7,, is essentially measurable G. 
We now define a new Q-discrete random variable q’’ and a corre- 
sponding y’’. The construction depends upon an arbitrarily chosen 
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po ¢ Q and an arbitrarily chosen real number a, although the notation 
will not emphasize this dependence. Later it will be shown that q’ = q’ 
and y” = y’ each with probability one, so that the dependence upon 
Po and a is not essential. 

Fix a po © Q and select a real number a. Define the random variable 
23,(@) by: 


if we UT,,,, 2) =a, 
reQ 


T¥Do 
otherwise, 2/’(w) = 2p,(w). 
Then z/’(w) is measurable G. Define 
25 = bp, peQ, pF Do. 


Then certainly each 2/’, pe Q, is measurable G. Define the Q-discrete 
random variable q’’(w) by 


Ifwe T,, V [(T3, — T,.) A fo! | We’), a) < ¥(e(o’), 2,(0’)) }] 
then q/’(w) = po; 
if we (TX — T,,.) A {o' | oe’), a) 2 oe’), %,.(’))} 
then q/’(w) is the least value of p « Q such that p ¥ p, and we T"*; 
if weQ — T*, then g”’(w) = q’(o). 


It is easily seen that this defines q’’ for all w e Q. 
We now define the random variable y” by y’’(w) = 247,..,,(w). Then 


vt 


y’”’ is measurable on G and the sample space of q’’, so that by con- 
struction [x | q’, F | y’’, G] e CQAz. Applying the hypothesis of weak 
y-optimality, we conclude that 


[ [v(x, y"”) — We, =] dP = Elv@, y’)} — Elve, } = 0. (20) 


We now partition the domain © of integration into the four sets 


A, = T,, 


A, = 13, — T,.) A to | ¥@@), a) < ¥(@), &.))}, 
A; = (T3, — T;.) A {o | ¥@@), a) 2 ¥(e@), %.@))}, 
A,=2-—Tz. 


That this is a partition follows from the definition and the fact, already 
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proved, that T,, C T;*. We consider the four resulting integrals 
individually, in the order of the listing. 
If we T,, then either we T,, A N, orwe T,, — N. We may ignore 
the first case. For the second, by definition of T,, , if r A po 
W(x(w), %.()) < Y(xw), 2,(w)). (21) 
Also, by definition 


@& A U Pak , 
reQ 
r¥#po 


and therefore by definition 2/’(w) = z,,(w), and q’’(w) = po. Then 

¥(e(w), yw) = W@(o), 20(w)) = We), %.()) 
and from the inequality (21) we conclude that the integrand 

¥(z(w), y"w)) — (zw), y)) < 0, 
since y(w) is equal to some z,(w), r « Q. Hence the integral over A, is 
not positive. 
If we Az, then by definition q’’(w) = po and 
y"(w) = 25% (w). 
eh we ignore the contribution of A. A N. If we A, — N then by 
18), 


We JT 64 


reQ 
T¥Do 


Then by definition 2/’(w) = a. Hence, the integrand 


¥(z@), y’’@)) — ¥@@), y@)) 
= [¥(e@), a) — ¥@), 2.@))] + [Y@@), %.@)) — ¥@@), y@))]. 
The first bracket on the right is <0 by definition of A, , and the second 
is $0 because we T,* and by definition of T;* we have ¥(x(w), 2,(w)) S 
W(xz(w), 2-(w)) for all r e« Q; among the latter is ¥(xw), y(w)). Hence 
the second integral is not positive, and its integrand is strictly negative. 
Now consider w ¢ A;. We ignore the integral over A; A N,. If 
we A; — N,, then g”’(w) = p ~ po andwe T* for some pe Q. For this 
w we have 


Vie), y"@)) = el), 2’@)) = ¥@), %@)) S ve), 2-(#)) 
for all r e Q; here the first equality is by definition of y’’, the second 
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by definition of 2/’ since p # pp, and the inequality is by definition 
of T*. But the inequality makes the integrand in (20) < 0, since 
y(w) = 2,(w) for some r « Q. Therefore, the integral over A; is not 
positive. 

Over A,, the integrand of (20) is 


p(x, y”’) _ v(x, y’)] a [y(2, y’) =s v(z, y)). 


The second bracket vanishes with probability one by 7 of Theorem 4, 
already proved. The first bracket is 


Y(t), 244) (@)) — 2), 2a°¢w) @)) 
and this vanishes for all w e A, by the definitions because over A, , 
t(w) < (aw), 2,(w)) so that q/(w) * po; therefore by definition 
24) (a) (@) = Zara) (w). 

We conclude from these calculations that the integral (20) cannot 
be positive. By (20), therefore, the integral vanishes. But the argument 
showed that the integrand was SO with probability one, hence indeed, 
the integrand vanishes with probability one: 


V(x, y”) = (a, y) with probability one. 


In particular, over A, , the integrand was strictly < 0. Therefore 
A, has probability zero. We shall now exploit this fact. 

In the argument above, a was any real number. Let {a,} be a countable 
dense set of real numbers and let 


W,, = {wo | W(t), an) < ¥(e), %.(#)) }- 


We have just proved that P{A.} = 0, which is to say that we could 
have proved, for each n, that 


PUT a Dae) ‘A W,} = 0. 
Then also 
N, = U (TS. ae (ey) A Wi, 


n 


is a null set. Now if we N., thenwe 7* — T,, and also there is some 
number a, such that 


¥(x(w), An) < Ye), %,()). (22) 


Conversely, if we T* — T,, and there is a number a, such that (22) 
is true, then we N,. Therefore if w 2 (T;* — T,,) — Nz, then for every 
number a, we have 


¥(z(w), a) 2 ¥((@), %,(w)). (23) 
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Given an we (7* — T,,) — Nz, choose a sequence a, — x(w). Assume 
that ye K. Then y is continuous and from (23) we have 


0 = ¥(e@), z()) = lim ¥(@), an) Z We), 2,(w)) 2 O. 


Notice, incidentally, that it suffices here for each x that y(z, y) be 
continuous for y in some neighborhood of x. This is an example of one 
way in which K can be enlarged. 

From this and item 7v in the definition of K, there is some integer K 
such that 


tw) = gu(2p,(w)). (24) 


Let C,, be the set of all w such that (24) holds. Since g; is Borel meas- 
urable, over C, , (24) constrains x to be equal to a function measurable 
G. If xe CCD(G), then C,, is a null set. But we have just showed above 
that 


Ces 


(Ty _ To.) ag Nz Cc 


C, . 


k= 


_ 


Therefore 

P{T# — T,,} = 0. 
Since po was arbitrary, this can be proved for each py « Q; therefore 
from (12) the T,, p « Q essentially cover Q. We proved along with 
definitions that the J, are pairwise disjoint, hence they partition 
Q — N;, where N; is some null set. 

We continue the argument using the selected p, . Forwe Q — N;, 
either we T,, orwe T, wherere Q butr ¥ py. In this latter case, however, 
as we proved with the definitions, » © 2 — T;* ; then by definition 
7¢’(w) = dw). Ifwe T,, , by the definitions q’’(w) = q/(w) = po. There- 
fore 

q’ = ¢q with probability one. (25) 
Furthermore we know that if we 7, , then q’(w) = p. From (25) 
yw) = 20h uy). (26) 


IfweQ — T,, , except at most on a null set we have q’’(w) + po and 
from (26) and the definition of 2/’ 


yw) = Zh ay (w) = Za’ (wy) y'(w), We (Q ~~ Po) A N; (27) 


where N, is a null set. Now if w « T,, — N, we showed earlier that 
z!’(w) = 2,,(w). Hence the equalities in (27) hold for we T,, — N as 
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well, so that 


y’ = y' with probability one. (28) 

Equalities (25) and (28) free the constructions from any dependence, 
except on a null set, upon the initially selected p,) and a. We need 
the Theorem to make identification with g and y. 

Let S, be that subset of 7, where q(w) # p. Then if we S, , by de- 
finition of T, , 
Y(t), y’@)) = VA), %()) < YA), 20) = ¥@), y)). 
From ai of Theorem 4, then, P{S,} = 0, and P{\U,.¢ S,} = 0. Since 
the T, , p « Q, essentially partition Q, it follows that q@ = q with prob- 
ability one, and at once that y(w) = 2 0)(@) = 2¢"(w)(w) = y'(w) with 
probability one. These conclusions are 777 and iv of the Theorem, the 
proof of which is now complete. 

To prove Theorem 1, let {2, , a, Yn} be a given Q-coded communica- 
tion system that is {y, , 0} optimal. Given n, by Lemma 1, 


[2p | Qn » X,,0 | Yn » Onl € CQAz, 


and is weakly y,-optimal. If y, ¢ K and xz, e CCD(Q,,.), Theorem 4 
proves that g, is measurable on {xz,} V Q,,.. But Q,,. is the sample 
space of {q,-1, Qa-2, ‘**}, and is therefore contained in the sample 
space of {a,-1 , Za-2, ++}, since by hypothesis {2, , g, , yn} 18 a Q-coded 
communication system. The hypothesis {x,} « CCD of Theorem 1 then 
implies that for the given n, x,¢ CCD(Q,,..), and Theorem 4 establishes 
Theorem 1. 

Turning to Theorem 8, let {z, , gn , Ya} be a given Q-coded com- 
munication system with decoder memory span m, and suppose that 
itis {y, , d, m}-optimal. By Lemma 1, then, given n, [%,-2 | Qn » Xn,a | Yn 
Qr.ml © CQAx,-2 and is weakly y,-optimal. By the hypotheses of The- 
orem 3, y, ¢ K, and {z,} e CCDf. Consider Q,,, , the sample space 
Of {Qn-1 » Qn-2 > °** » Un-m}. Suppose first that m > d; then this sample 
space is the smallest Borel field which contains both the sample space 
of {Qn-1 5 *** » Qn-a} and that of {q,-a-1, ++ » Un-m}. Since M < o~, 
the first of these is a finite field, and the second is a subfield of {2,-2-1 , 
Yn-a-2 , °°} (since {a, , Gr » Yn} iS indeed a Q-coded communication 
system). The hypothesis {z,} « CCDf then implies that x, ¢ CCD(Q,,m). 
If m S d, the subfield of {z,z-1 , ---} is empty, but the reasoning 
and conclusion are still valid. Then Theorem 4 applies and we conclude 
that g, is measurable on the sample space of {%,-2 , Ga-1» ‘°° » In—m}- 
This is the first conclusion of Theorem 8. We note now that a weaker 
hypothesis than {z,} «= CCDf could suffice here. Indeed, if m < o, 
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it is sufficient that: if A is a finite field then x, « CCD(A). This is the 
final conclusion of Theorem 3. 

Given that q, is essentially measurable on {%,-2, Qa-1) *** » Gn-mds 
for each n, we conclude by induction that q, is essentially measurable 
{Wn-a) Ln-d-1) Un-2) °** » Qn-m-if, °°* and finally then that q, is es- 
sentially measurable {2,7 , Zn-a-1, °° +}. Define 


Qn = Qn+da 3 
Y, = Unea 5 n=0,+1,-:-. 


Then it is a simple translation of notation to verify that {z, , g/ , y{} 
is a Q-coded communication system with decoder memory span m 
that is {y/ , 0, m}-optimal, where ¥/ = Yrarz, 2 = 0, £1, --- . This 
is the second conclusion of Theorem 3. 

Finally, if d = 0, then “{x,} « CCDf”’ may be replaced by: ‘‘{z,} 
« CCD.” Then M is unrestricted, since no ‘future’ is involved that 
must be restricted to a finite field. This completes the proof. 

Theorem 2 is a limiting case of Theorem 3, proved by putting m = 
everywhere in the proof of Theorem 38. 


Vv. A COROLLARY 


It is a consequence of Lemma 2 and of the proof of Theorem 4 that, 
given w, in a set of probability one, g(w) is that unique value of p which 
minimizes y(z(w), z,(w)). (This was remarked in connection with 
equation 25.) Applying this to the situation of Theorem 1, one sees 
that the transmitter of a delay-free Q-coded communication system 
{Un » Gn » Yn} Satisfying Theorem 1 has the block diagram form shown 
in Fig. 1. df d > 0, one simply puts an analog delay line in the input 
lead, ahead of the rest of the system.) 

This block diagram can be described thus: at time subsequent to 
t = n — 1 and prior to ¢ = n, the transmitter has in its digital store 
the values g,-1, Gn-2 , °°: of the previously transmitted signals. From 
these, quantities z,,, , 22. , 23.n, °** are constructed. These are the 
2, of Lemma 2, for the particular random variable y, . When x, becomes 
available, quantities W,,(%, , 21.n)) Wn(Un » Z2.n), °** are constructed and 
the comparator identifies the least of these (unique with probability 
one). The transmitted g, is that value of the index which identifies 
the least Wa(@n , 2p.,). This index is transmitted to the receiver as q, 
and is also stored in the transmitter’s memory for the next cycle. 
The receiver can be realized using a portion of the transmitter, as 
suggested in Fig. 2. Each function generator in these diagrams can 
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Fig. 1 — Generalized form of optimum transmitter. 


of course be nonstationary. Connections to a master “‘clock” are not 
shown. 


VI. REMARKS ON K AND CCD 


One might ask to what degree are the central hypotheses of Theorem 4 
necessary to the conclusions. The theorem itself provides a partial 
answer: conclusions 7 and 77 do not use x e CCD(G) at all, and use 
only a measurability and a boundedness property of y. The critical 
conclusions are the uniqueness conclusions 722 and zv. Clearly, something 
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Fig. 2 — Form of receiver. 
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is required of y(x, y) that makes it, in some sense, smaller when y = x 
than elsewhere, and not too indifferent to the value of y when y ¥ a, 
if uniqueness is to be expected from the hypothesis of y-optimality. 
As we have already noted, the hypothesis y « K is fairly weak in this 
regard, and could, in the presence of CCD, be made weaker at the 
expense of further elaboration of the proof. 

The interesting hypothesis is x e CCD(G). This implies that if x has 
a conditional probability distribution relative to the field G, then that 
distribution is continuous. It is easy to see that the y-optimum quantiz- 
ing of a random variable x need not be unique if the distribution of x 
is not continuous, even when one uses ¥(z, y) = (x — y)’. Since y 
in Theorem 2 y-optimally quantizes x for each event measurable on 
the conditioning field G, something like x e CCD(G) is necessary if 
conclusion 7zv is to follow. Thus we conclude a loose kind of necessity 
for this hypothesis. 

We notice finally that 777 and 2 were proved by confining the re- 
dundancy among the {z, , p « Q} to a null set. In the application of 
this idea to the situation of Theorem 1, it seems likely that redundancy 
in the {z,, , p © Q} for some fixed n might indeed be exploited to improve 
some 


E{Wnse(Snen ? Ucn) 3 k > 0, (29) 


by selection, among the minimizing z,, to which EL{y,(a, , Yn)} is in- 
different, one which actually contributes information about x,,, and 
therefore allows a reduction in (29). I have no example to show this 
phenomenon, so its existence remains a conjecture. We have proved, 
of course, that its possible existence is ruled out by x e CCD(G). 


REFERENCES 


1. Shannon, C. E., ‘‘A Mathematical Theory of Communication,’’ B.S.T.J., 27, 
Nos. 3 and 4 (July and October 1948), pp. 379-423, 623-656. 


The Equivalence of Certain Harper Codes 


By MORGAN M. BUCHNER, Jr. 
(Manuscript received May 12, 1969) 


A class of binary encoding algorithms called Harper codes has been 
studied previously as a means of encoding numbers for transmission over 
an idealized binary channel. This paper considers a more general and 
practical transmission system model. For any Harper code, it presents 
a technique for obtaining the expression for the average absolute numerical 
error that occurs during transmission. It shows that all Harper codes 
do not exhibit the same average absolute numerical error for all transmission 
systems that satisfy the model. However, there 1s a subset of Harper codes 
such that all codes in the subset give identical performance. The paper 
defines the subset and presents an expression for the average absolute nu- 
merical error for any Harper code in the subset. The subset 1s important 
because tt includes the natural binary representation, the Gray code, and 
the folded binary code. 


I. INTRODUCTION 


In order to send numerical data over a binary channel, each input 
number must be encoded into a suitable binary sequence for transmis- 
sion. For example, when a sampler and quantizer are used, a binary 
sequence is assigned to each quantization level. For each sample, the 
number of the appropriate quantization level is transmitted by sending 
the binary sequence assigned to the level. But how should the binary 
sequences be assigned? One approach is to use the natural binary 
representation of each number. Alternatively, a Gray code might be 
used with the idea that its unit-distance properties are in some sense 
desirable. 

If the transmission system is error-free and if the binary sequences 
are unique, it does not matter how the sequences are assigned. How- 
ever, if transmission errors can occur, some assignment algorithms may 
be preferable to others. In this paper, the performance of certain 
binary encoding algorithms is considered. The average magnitude by 
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which the number delivered to the destination differs from the trans- 
mitted number is used as the criterion of performance. 

Previously, Harper presented a class of binary codes that we call 
Harper codes.’ The class includes the natural binary representation, 
the Gray code, and the folded binary code. Reference 2 showed that 
for any set of 2" input numbers all Harper codes exhibit the same mean 
magnitude error when used with a specific binary transmission system 
model (see Section II) and that, when the probability of transmission 
error is sufficiently small, Harper codes are optimum. 

In this paper, a more general transmission system model is considered. 
For 2" equally spaced input numbers, a means of obtaining the expres- 
sion for the average absolute numerical error (hereafter called average 
numerical error) for any Harper code is presented. All Harper codes 
do not exhibit the same average numerical error except in the special 
case when the transmission system model reduces to the model used 
in Ref. 2. However, there does exist a subset of Harper codes such 
that all codes in the subset are equivalent in performance. The subset 
is defined and an expression is given for the average numerical error 
for any Harper code in the subset. The subset is important because 
it includes the natural binary representation, the Gray code, and the 
folded binary code. 


II. SYSTEM MODEL AND PREVIOUS RESULTS 


A system model is shown in Fig. 1. In general, we wish to send over 
a binary transmission system’ any one of the 2" equally likely numbers 
of the form A + Bs where s is an integer, 0 < s S 2° — 1. At the 
transmitter, the binary encoder receives A + Bs and, based upon s, 
sends a k-bit binary sequence assigned by a Harper code and denoted 
by #,(s). At the receiver, a binary decoder receives a k-bit binary 
sequence H,(r),0 S r S 2" — 1, and generates A + Br. Let Pr[H,(r) | 
H,(s)] denote the probability of receiving H,(r) when H,(s) is sent. 
If all s are equally likely, the average numerical error (as in Ref. 3) 
that occurs is 


B 2k—1 2k-1 


ANE => DY Dlr —s| Prt) | HiG)]. (1) 
r=0 s=0 
The average numerical error is dependent upon the binary encoding 
algorithm and the transmission system through Pr[H,(r) | H;(s)]. 
t It is important to distinguish between the binary transmission system and the 


channel. The transmission system includes the channel and the encoder and decoder 
for error control. 
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Harper codes are defined in terms of the vertices of the k-cube’. 
Assign 0 to an arbitrary vertex; that is, H,(0) is arbitrary. Having 
assigned 0, 1, 2, --- , J — 1, assign 2 to an unnumbered vertex (not 
necessarily unique) that has the most numbered one-distant neighbors.* 
In the remainder of this paper, certain properties of Harper codes 
presented in Refs. 1 and 2 are used without specific reference. 

We can now summarize the results in Ref. 2. In a binary transmis- 
sion system as shown in Fig. 1, it was assumed that the errors between 


BINARY TRANSMISSION 
SYSTEM 





NUMERICAL BINARY 
SOURCE A+Bs ENCODER 










DESTINATION 





BINARY 
DECODER 


Fig. 1 — System model 


locations 1 and 2 are independent of the transmitted bits and occur 
independently of one another with probability p,. For such a trans- 
mission system and for any set of 2° input numbers, it was shown that 
all Harper codes yield the same mean magnitude error and, thus, 
are equivalent. Also, it was shown that when 7, Is sufficiently small, 
Harper codes are optimum for any set of 2° input numbers because 
they minimize the mean magnitude error. Of course, the results in 
Ref. 2 are applicable to our set of 2° equally spaced numbers and 
indicate that all Harper codes yield the same average numerical error 
for a transmission system that satisfies the model in Ref. 2. 

However, the transmission system model in Ref. 2 is extremely 
restrictive. Channels with correlated errors are excluded. The model 
also excludes transmission systems using many types of error-cor- 
recting codes even if the actual channel is a memoryless binary sym- 
metric channel with probability of bit error p. In fact, even the Hamming 


t The weight of an n-tuple v is the number of nonzero components in v and is 
denoted by wf{v]. The distance between two binary n-tuples u and v is w[u @ v] where 
@ denotes component by component modulo 2 addition of n-tuples. 
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perfect single error-correcting codes when used with a memoryless 
binary symmetric channel do not comply with the model in Ref. 2. 
The reason is that, in a Hamming code, all H,(s) of a particular weight 
are not encoded as code vectors of equal weight. Thus, all error 
patterns of equal weight in the Harper code sequences do not occur 
with equal probability. However, in order for a transmission system 
to satisfy the model in Ref. 2, all error patterns of equal weight must 
occur with equal probability. It follows that the Hamming code violates 
the model in Ref. 2. 

An interesting approach to coding for numerical data transmission 
is found in unequal error-protection codes*. The idea behind unequal 
error-protection codes is to match the protection provided by the code 
to the numerical significance of the transmitted bits. Significant-bit 
codes (a subclass of unequal error-protection codes) have been shown 
to be effective in reducing the average numerical error and in many 
cases are easy to implement.*’® However, the transmission system 
model in Ref. 2 excludes unequal error-protection codes which is un- 
fortunate because these codes are directly applicable to the basic 
problem considered in Ref. 2, that is, reducing the average numerical 
error. 

Accordingly, it is important to examine the performance of Harper 
codes when a less restrictive and more practical transmission system 
model is used. For our model, we assume simply that the transmission 
system is binary and that the errors are independent of the transmitted 
bits. A binary transmission system satisfies this model if, for every 
integer 7,0 <r < 2° — 1, and integer s,0 < s < 2° — 1, there exists 
an integer t,0 < ¢ < 2° — 1, such that 


Pr[H,(r) | Hz(s)] = Pr[H,(t) | B.(0)] 2) 
where 
H,(t) = H,(r) @ H,(s) (3) 


and B,(j) denotes the 7-bit natural binary representation of the integer 
j,0 Sj S 2° — 1. Observe that equation (2) implies that the prob- 
ability of a particular error pattern H,(é) in a Harper code sequence 
is independent of the transmitted sequence. 

Because the transmission system model is not very restrictive, the 
results to be presented are applicable to a wide range of practical 
systems. For example, the model is satisfied by the important class 
of binary transmission systems composed of 
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(z) a linear block code with a decoding scheme equivalent to Slepian’s 
standard array’, and 

(iz) a binary symmetric channel in which the errors are independent 
of the transmitted bits. 


III. THE AVERAGE NUMERICAL ERROR FOR A HARPER CODE 


Let H’ be a Harper code in which s is encoded as H{(s). From the 
definition of a Harper code, it is possible that H/(0) ¥ B,(0). We first 
show that if H/(0) ¥ B,(0), then a Harper code H [in which s is encoded 
as H,(s)] can be constructed such that (z) H,(0) = B,(0) and, (22) the 
performance of H’ is identical to the performance of H. The average 
numerical error for H’ is 


2k—-1 2k-1 


ANE! = 3 YY |r— 8 | PrlHi) | HiG)]. (4) 


r=0 3=0 
Let H be a code whose elements are obtained from the elements of H’ 
by the relation 
H,(s) = Hi(s) ® AiO). (5) 


From (5), H,(0) = B,(0). 

We now show that H is a Harper code. Clearly H,(0) satisfies the 
requirements for a Harper code. Suppose that H,(0) through H,(1 — 1) 
have been determined by (5). Now, if H/(s) is distance d from H/(]), 
then #,(s) is distance d from H,(l). Thus, if H{(l) is assigned to have 
the most numbered one-distant neighbors, H,(l) will have the most 
numbered one-distant neighbors. It follows that H is a Harper code. 

The average numerical error for H is given by equation (1). We 
must show that the expression for ANE is identical to the expression 
for ANE’. From (2), 


Pr[Hi(r) | Hi(s)] = PrlHicr) © Ais) | B.()]. 
Also, from (2), 
Pr[H,(r) | H.(s)] = Pr{H,(r) @ H,(s) | B.(0)]. 
From (5), 
H,(r) ® H.(s) = Hi(r) © AiG). 
Therefore, 


Pr[H.(r) | H.(s)] = PrlHic) | Hi()] 
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and, by (1) and (4), 
ANE = ANE’. 


Thus, every Harper code is equivalent in performance to a Harper 
code in which 


H,(0) = B,(0). (6) 


For convenience and without loss of generality, we shall consider the 
performance of Harper codes that satisfy (6). At the end of Section IV, 
we remove this restriction and give, in general terms, the structure 
of all Harper codes that are equivalent to the natural binary rep- 
resentation. 

Now, let us consider the expression for the average numerical error 
for H. By substituting (2) into (1) and rewriting, 


2k—y 2k-1 


ANE = D Dlr — 8 | Pre | BO] 7) 
where r, is the value of r in (8), that is, 
H,(r,) = H,(s) <p) H(t). (8) 


Now, (7) can be written as 


ANE = a De Prl#,(2) | B.(0)] (9) 
where 
= Sin —s|. (10) 


The expression for the average numerical error is determined by spec- 
ifying each C, (1 S ¢ S 2" — 1). 

In order to evaluate C : , we proceed as follows. Divide the 2” elements 
of H into k + 1 sets called levels. The 0-level contains H,,(0) exclusively. 
For 1 <j S k, the j-level is the set of H,,(s) for which 2'* < s < 2? — 1. 
Because H is a Harper code, the elements of level j are in the shadow 
of the (j — 1)-subcube’ formed by the elements of levels 0 through 
j — 1. From equation (6) and the definition of a Harper code, it follows 
that each element of the j-level has a one in a particular position which 
we call the j-position. Thus, the j-level consists of the k-tuples that 


+ A (j — 1)-subcube of the k-cube is a set of all k-tuples that are the same in 
j. +1 positions. The shadow of a (j — 1)-subcube is obtained by changing 
one mat the k — 7 + 1 fixed positions. 
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have zeros in positions 7 + 1 through k, a one in position j, and all 
possible (7 — 1)-tuples in positions 1 through 7 — 1. 

Notice that the position numbers are determined by the structure 
of the Harper code and not by the order in which the bits are arranged 
for transmission. For example, in the Harper code shown in Table I, 
Pr[H,(2) | B.(0)] is the probability that no transmission errors occur in 
positions 1, 3, and 4 and that a transmission error occurs in position 2. 
If transmitted in the order shown in Table I, Pr[H,(2) | B,(0)] is the 
probability that the error sequence 0001 occurs. 

We must determine C, for each of the 2" — 1 nonzero values of t. 
Thus, we regard ¢ as known and seek to determine C;. Let o be an 
integer such that 


oy sts 2° —1, (11) 


Because H satisfies equation (6), H,(¢) has a one in position c. To 
evaluate C; , we rewrite (10) to exhibit the levels of s as 


Ga(n+h Sin-si)+  Sin-e| ay) 


7=1 s=2Qi-1 j=ot+1 s=Q?-2 


TasB_LE I—A k = 4 Harper Cope 


Level 
8 A4(s) number 
0 0000 0 
1 0010 1 
2 0001 2 
3 O0oo0i1i1 2 
4 Olli 3 
5 0110 3 
6 Oo101 3 
7 0100 3 
8 1000 4 
9 1001 4 
10 1011 4 
11 1010 4 
12 1100 4 
13 1110 4 
14 1101 4 
15 11ii1ii1 4 

position 4 position 2 


position 3 position 1 


3120 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1969 


where the 0-level is shown individually as r, and j indexes the levels 
from 1 to k. The parentheses enclose the contribution of levels 0 through 
o. From Appendix A, 


o 2i~1 
(r, +>) dD In-s ) ages (13) 
7=1 s=Qi-1t 
Now, consider the set of H,(s) in the j-level whereo + 15 575k 
and 27-* < s < 2’ — 1. First, we define a run as follows.’ In the j-level, 
there is a run in position m, 1 S$ m S j — 1, that starts at s) and is 
of length R(m, so) if and only if 


(7) R(m, so) = 2’ for some integer 1 = 0, 

(it) the set of H,(s) for s S s S s + 2' — 1 forms an l-subcube 
of the k-cube where m is one of the k — | fixed positions, 

(iit) the set of H,(s) for s + 2’ < s S s) + 2’*' — 1 forms an l- 
subcube that is in the shadow of the subcube in (72), 

(iv) the subcube in (772) is distinguished from the subcube in (72) 
by position m, and 

(v) the H,(s) for 27-* < s S s — 1 ean be divided into runs of 
length 2' although perhaps not in position m. 


An example from Table I will illustrate the definition of a run. 
Consider the 4-level. Then H,(8) starts a run of length 1 in position 2, 
a run of length 2 in position 1, and a run of length 4 in position 3. Thus, 


R(,8)=2 R(2,8)=1 #8) =4. 


Let w[H,()] = w and let t,, t, ---,¢, denote the w nonzero 
positions in H,(é). Then R(t, , 2’~’) is the length of the run in position 
t, that starts at 2’~" (that is, the length of the first run in position 
tm In the j-level). Let 


¥ii(t) = Max R(t, , 277"). 


From Appendix C, 


27-14 275 .4(t)-1 
eg — § | 
g=Qin-t 
2i-ley5 .(t)-1 2-14 275 ,1(t)—1 
= DY @-s+ De GH) = Wald. 
a=Qi-2 s=2?—ltey5ii(t) 


t Appendix B contains a more complete discussion of the structure of the j-level 
of a Harper code and the relationship between the structure and the concept of a 
run. It is shown that runs are basic to the structure of Harper codes and that the 
definition of a run is meaningful and consistent. 


HARPER CODE EQUIVALENCE 3121 


The above process can be extended to obtain y;,,(t) after y;.,(0), 
v;.2(t), +++ , Y;,i-1() are known. Specifically, 


t-1 
7; .:(t) = Max R(t, ) out + 2 be vl) ° 
m [=1 


Then 
i 
gi-s—a42 De vyi.t(t) 
= 
oO. 2 
> Tr, — 8 | = Dy7 HO: 
[ee 
s=2i-142Q py yi. u(t) 
=1 


By continuing the process, we eventually exhaust the 2’~’ values 
of s in the #-level. Let g; denote the number of y;.;(£) needed to cover 
the j-level, that is, 


20.) = 2. 


t=] 


It follows that 


27-1 


~ In-sl= 2 Dll). (14) 


s=Qint 


From (12), (18), and (14), 


=e 42 > Sa, (t). (15) 


f=o+1 i= 


By substituting (15) into (9), 


ANE = & pues 42 > Sat co)Prtino |B] 6) 


f=o+1 t= 


where oc is given by (11). The expression in (16) is particularly useful 
because it consists exclusively of error probabilities conditional upon 
B,(0) being transmitted and the 7;,,;(¢) can be obtained directly from 
the Harper code. A numerical example that illustrates the use of (15) 
and (16) is given in Appendix D. 

We now consider the condition under which two Harper codes give 
identical performance. Let H’ be a Harper code that is not H (that is, 
H’ cannot be obtained from H by a relationship of the form H{(s) = 
H,(s) ® B,(s,) where s, is an arbitrary integer, 0 S s, S 2° — 1). 
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From (9), for H’, 
B 2k-1 
ANE’ = 5 >, CLPr[Hi(v) | B,(0)]. 
t'=1 


Then H and H’ exhibit identical performance for any transmission 
system that satisfies our model only if, for every t, Ci, = C, where 
t’ is determined by Hi(t’) = H,(é). Conversely, if Cl, ~ C;, for at least 
one value of ¢, the two codes may or may not give the same performance, 
depending upon the error statistics of the transmission system. 


IV. CODES EQUIVALENT TO THE NATURAL BINARY REPRESENTATION 


Because of the considerable structure in the natural binary rep- 
resentation, it is easy to use (15) to compute each C,, 1 <¢ S 2° — 1. 
Tor a given ¢t, we first find o by (11), that is, o — 1 is the largest power 
of 2 in t. Then, for each j,o + 1 S j S k, we determine g; and the 
y;,:(t). For the natural binary representation, 


vii) = 277 (17) 
for 1 $7 Sg; sog; = 2'-**. Therefore, by (15) and (17), 


k ginen1 
C; = ert + 9 >y >> gee2 aS aa a (18) 
f=otl i=1 
Notice that each C,, 27> < t S 2’ — 1, is equal to 2°*’~*. Thus, 
C, is determined simply by the largest power of 2 in ¢. Substituting 
(18) into (9) and rewriting, we obtain 


V Alaa | 


ANE, = B >» ge os. Pr[B,(t) | B,(0)] (19) 


where ANE, denotes the average numerical error for the natural 
binary representation. 

Is it possible to find a Harper code H that is not the natural binary 
representation but that exhibits performance that is identical to the 
natural binary representation for all transmission systems that satisfy 
our model? The answer is yes. We now show that a necessary and 
sufficient condition is that 


Vii) = a (20a) 
forl S72 S g; and 


a (20b) 
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for each t, 1 < t S 2° — 1, and for each j,o + 1 Sj S k (where cis 
chosen so that 2°"' < ¢ < 2’ — 1). 

If (20) is satisfied, then by (15), C, = 2°*’”’. The average numerical 
error for H (denoted by ANE) is 


> ea 


ANE, = B > 2°" >) Pr[H,(2) | B,(0)). (21) 


=Q7-1 


By the definition of a Harper code and the definition of a level, 


DP |BO]= YF PHBO|BO —@) 


Therefore, by (19), (21), and (22), ANE, = ANE, . It follows that 
(20) is a sufficient condition. 

We now show by contradiction that (20) is a necessary condition. 
Consider the set of coefficients C.--. for 1 < o < k. From (15), 


k oi 
Cares = gt i 2 > 22"): 


j=o+1 i=1 


The term 2°" is independent of the particular Harper code used. 
Thus, we need only consider the summation part. Suppose that it is 
possible to arrange the y;,;(2”~*) so that they are not all equal to 27° 
but keep Cy--. = 2**°"'. If this is done, at least one y;,;(27~*) will 
be less than 2’~* and at least one 7;,;(2’~*) will be greater than 27~*. 
However, in order for one ¥;,;(2” ") to be less than 2”*, there must 
exist ao’ < o such that y;,,;-(2°~*) > 2”°~*. But in order for Cy-: = 
2**°'-1 there must be at least one y;,.;-(2’'~') < 2°’~'. The argument 
continues until we reach y;--.;--(2°) where there must be at least one 


jer gtt (2°) > oo (23) 


However, in order for C.. = 2", (23) implies that there must be at 
least one y;’',;'"(2°) < 2°, which is impossible. It follows that (20) 
must hold in order for a Harper code to be equivalent to the natural 
binary representation. 

We can show the existence of a great many Harper codes other than 
the natural binary representation that satisfy (20) by presenting 
explicitly the structure implied by (20). At this point, we no longer 
assume that H,(0) = B,(0) but state the structure for any Harper 
code. List the H,(s) sequentially as s runs from 0 to 2° — 1. For po- 
sition 7, 1 < i < k, divide the s into 2°-**’ consecutive intervals each 
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of length 2*~*. Let j index the intervals where 0 S$ j S 2*°*** — 1. 

A Harper code is equivalent to the natural binary representation 
if and only if, for every odd numbered interval (j odd), the binary 
digit in position 7 is the complement of the binary digit in position 7 
in the immediately preceding even numbered interval (j even). The 
digit in position 7 in the even numbered intervals is arbitrary. 

The structure is presented graphically in Table II for k = 5. The 
symbol b;,; denotes the binary digit in position 7 in the jth interval. 
For odd j, b;,; = b*,;_, (where b*,_, = 1 @ ),,;-1) and, thus, b¥,_, 
is shown in Table II for odd j. For all even j, b;,; can be assigned arbi- 
trarily for each 7. 

The expression for the average numerical error of the Harper codes 
that are equivalent to the natural binary representation is interesting. 
From (21), the set of error probabilities Pr[H,(t) | B,(0)] for 27* s 
t < 2° — 1 (that is, for ¢ in the o-level) all have the weighting coeffi- 
cient 2°". Thus, the cost of a particular error pattern is the numerical 
significance of the most significant bit in error. When one considers 
unequal error-protection codes, the structure in (21) is very convenient 
because the protection against transmission errors can be matched 
to the significance of the bit positions. However, for a Harper code 
that is not equivalent to the natural binary representation, the average 
numerical error does not exhibit the above structure. Therefore, un- 
equal error-protection codes appear to be less applicable. 


Vv. THE GRAY CODE AND THE FOLDED BINARY CODE 


The Gray code and the folded binary code are of interest because 
of their possible applicability to numerical data transmission.’ This 
section shows that both of these codes exhibit performance that is 
identical to the performance of the natural binary representation for 
all binary transmission systems that satisfy our model. 

Let the k-bit binary representation of s be B,(s) = (b: , bx-1, ° ++ , 01) 
where 6;, 1 <7 S k, is the binary digit in position 7 and 


k 
$s = > b2 
t=1 


As in Section III, the position numbers are defined in terms of the 
structure of the code, not the order in which the bits are transmitted. 
From Ref. 7, the Gray code representation of s, denoted by G;,(s), is 
Gi(s) = (b,, 06. © bs-1, «++ » bo P 51). We show that the Gray code 
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is equivalent to the natural binary representation by showing that 
the structure of the Gray code conforms with the structure in Table II. 
Consider position 7. As in the construction of Table II, divide the 
range for s into consecutive intervals each of length 2‘~’ and number 
the intervals sequentially from 0 to 2°-*** — 1. The binary digit in 
position 7 of G,(s) in an even numbered interval is b;,, @ b; and the 


TasBLE II—SrrucrurRe FoR A HARPER CopE EQUIVALENT TO THE 
NaturAu BINARY REPRESENTATION; k = 5 


H;(8) 
Position Number 
8 5 4 3 2 1 
0 bso bg .0 bs 9 boo bio 
1 | | oF 
2 bX 0 bio 
3 { OF 
4 bF 0 boo bis 
5 i by 4 
6 OF 2 b1,6 
7 1 bY 6 
8 bf o bg 2 bo 4 by ,8 
9 { b¥ 
10 bF 4 bi 10 
11 { oF 10 
12 bo bo 6 bi 12 
13 J bF 12 
14 be 6 by 14 
15 + by 14 
16 bF 0 bg 2 bs 4 beg bi 16 
17 J b 16 
18 bes by 18 
19 { b¥ 18 
20 bs 4 be 10 by 20 
21 J T,20 
22 bE 10 by 22 
23 4 dF 20 
24 bio b3\6 be 12 by 24 
25 iy bY 04 
26 bF 12 bi 26 
27 b¥ 96 
28 DF 6 bo 14 bi 28 
29 bT 28 
30 be 14 by 30 
31 bT 30 
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binary digit in position 7 in the immediately following odd numbered 
interval is b;,; @ b& = (6:41 @ 5,)*. Therefore, from the structure 
in Table II, the Gray code is equivalent to the natural binary rep- 
resentation. 

It is also interesting to consider the folded binary code®. Let F;(s) 
denote the representation of s. Then F,(s) = (b,, b§ @ by-1, --° , 
b* @ b,) where b* = b, @ 1. As in the case of the Gray code, consider 
position 7 and divide the range for s into intervals of length 2°~*. The 
binary digit in position 7 of F,(s) in an even numbered interval is 
b* @ b;. The binary digit in position 7 in the immediately following 
odd numbered interval is b* GQ b* = (b% @ b;)*. Therefore, from the 
structure in Table II, the folded binary code is equivalent to the natural 
binary representation. 


VI. CONCLUSIONS 


The model used in this paper for the binary transmission system 
is quite general and is satisfied by a wide range of practical systems 
including many that utilize error-correcting codes. A technique is 
presented for determining the average numerical error for any Harper 
code. All Harper codes do not exhibit equal performance for all trans- 
mission systems that satisfy the model. Because the performance of 
a given Harper code is closely related to the error statistics of the 
transmission system, it does not appear possible to specify a Harper 
code that is best for all applications. However, a subset of Harper 
codes is defined such that all codes in the subset give identical per- 
formance for all transmission systems covered by the model. The 
subset is important because it includes the natural binary represen- 
tation, the Gray code, and the folded binary code. Unequal error- 
protection codes appear to be particularly applicable to Harper codes 
in the subset. 


APPENDIX A 


Contribution of Levels 0 through o to C, 
To determine the contribution of levels 0 through o to C,, we must 
evaluate 
o 2i-4 


fot Oe De In-el= Din-sl. 


g=1 s=2i7-2 


From equation (8), for every s in the range 0 S s < 2” — 1, there 
exists a unique 7, in the range 27" < 7, S 2” — 1. As s runs from 0 
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through 27" — 1, every r, in the range 27° < r, S 2’ — 1 occurs 
once and only once. Similarly, as s runs from 2” * through 2” — 1, 
every 7, in the range 0 S r, S 2’-* — 1 occurs once and only once. 
Accordingly, 


Q27—-1t-y 2 


20-1 o—1 
Dlr os] = > (7, —s) + (s —1r,) = 27777, 
8s=0 8=0 eaQe—2 


APPENDIX B 


The Structure of the j-Level of a Harper Code 


Consider the set of H,(s) in the j-level of a Harper code where 
2°" < s S 2? — 1. For clarity, Table III illustrates the ideas pre- 
sented here by applying the ideas to the 4-level of the Harper code 
in Table I. 

Let p be an integer, 1 S p S j — 1. For each value of p, the j-level 
can be divided into 2’~’ sets of consecutive values of s each set of 
length 2’~*. The sets are numbered consecutively from 0 through 
2'-* — J as follows. Let é be an integer, 0 S$ & S$ 27-°7* — 1. For each 
value of é, there will be two sets; an even numbered set whose number 
is of the form 2£ and an odd numbered set whose number is of the form 
2— + 1. 

An even numbered set contains the H,(s) for 2’ + 222°" <s < 
Qi? + (2¢ + 1)2’°"* — 1 and forms a (p — 1)-subcube because H 
is a Harper code. Similarly, an odd numbered set contains the H,(s) 
for 2'-* + (2 + 1)2° S 5 S 2)? + (2¢ + 2)2°"' — 1 and forms 
a (p — 1)-subcube. The important point is that for each value of &, 
a useful relationship exists between set 2 and set 2§ + 1. Specifically, 


TaBLeE IJ]—Deraizs or 4-LEVEL oF HARPER CODE IN TABLE I 


p=l1 p=2 p=3 
8 Hi4(s) Set é Set E Set E 
8 1000 0 0 0 0 0 0 
9 1001 1 0 0 0 0 0 
10 1011 2 1 1 0 0 0 
1l 1010 3 1 J 0 0 0 
12 1100 4 2 2 1 1 0 
13 1110 5 2 2 1 1 0 
14 1101 6 3 3 1 1 0 
15 1111 7 3 3 1 1 0 


position oil I position 2 
p 


position 3 osition 1 
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the (9 — 1)-subcube formed by set 2 + 1 is in the shadow of the 
(op — 1)-subcube formed by set 2. Accordingly, all H;,(s) in set 2 + 1 
differ in exactly one position from all H;,(s) in set 2 Denote the po- 
sition that distinguishes the subcubes by m. Therefore, the 2¢ set 
consists of 2’~* elements each of which has the same binary digit in 
position m. Similarly, the 2 + 1 set consists of 2°~* elements each 
of which has in position m the complement of the binary digit in po- 
sition m in the elements of set 2é. 

The above sets form what we call a run in position m of length 2°7' 
that starts at 277’ + 2£2°7' (the first H,(s) in set 2¢). The definition 
in Section III follows from the preceding sentence. 


APPENDIX C 


Contribution of First 2y;,,(t) Values of s in Level j to C;, 


From equation (8), as s runs from 2’~* through 2'~* + y,.,( — 1, 
every 7, in the range 27-' + y;.,(i) S r, S 277’ + 2y;,:() — 1 occurs 
once and only once. Similarly, as s runs from 2’~* + y¥;.,(é) through 
2i-* + Qy; ,(t) — 1, every r, in the range 277’ S r, S$ 27° + 4;,() — 1 
occurs once and only once. Therefore, 


2-14-2475 ,1(t)—1 
[77 — 8 | 
geaQra-) 
2Qi-teyg a(t)—1 27-1427 ,1(t)-1 
2 
= » Gem's) + : > (s — 7) = 243,1(). 
s=Qi-2 s=a2i—ltyyii(t) 


APPENDIX D 


Numerical Example to Illustrate Equations (15) and (16) 


Consider the Harper code given in Table I. We show how to use 
equation (15) when ¢ = 2 and ¢ = 8 to find C, and C3, respectively. 
For ¢ = 2, 0 = 280, from (15) 


C,.=8+2 >) > v7.2). 
7=3 i=1 


In the 3-level, y3,,(2) and y3,2.(2) are shown in Table IV. Therefore, 
gs; = 2. Also, in the 4-level, y4,:(2), ys.2(2) and y4,3(2) are given in 
Table IV. Thus, g, = 3. It follows that 


Cee BO PS 
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TaBLE [V—ILLUSTRATION OF EQUATION (15) APPLIED TO THE 
HARPER CopE IN TABLE I 























8 Ha(s) ¥7.4(2) vi.i(3) 

0 0-level 0000 

1 1-level 0010 

2 0001 

3 2-level 0011 

; : ‘i 7 y31(2) = 1 v3,1(3) = 2 
6  slevel LO . 4940)S4 

7 0100 

8 1000 -sn@y=1 ivaeyee 
9 1001 

10 1 0 1 1 "v4, 2(2) = 1 

ll 1010 

12 ‘level L100: wOVee 4axsyed 
13 1110 

14 1101 

15 LT1iiil 


position a = 2 


position 3 position 1 


Similarly, for ¢ = 3, o = 20, from (15), 


4 97 
8+ 2 > 2 ¥.,:(8). 
In Table IV, y3,1(8), v4,1(8) and 4,2(8) are given. Thus, 
C; =8 +2 (2? + 2? + 2”) = 32. 


By similar reasoning, the remaining C,; can be found. The expression 
for the average numerical error of the Harper code in Table I is 


C; 


ANE = i. (24Pr[1 | 0] + 24Pr[2 | 0] + 32Pr[3 | 0} -+ 64Pr[4 | 0] 


+ 64Pr[5 | 0] -+ 64Pr[6 | 0] + 64Pr[7 | 0] + 128Pr[8 | 0] 
+ 128Pr[9 | 0] + 128Pr[10 | 0] + 128Pr[i1 | 0] + 128Pr[12 | 0] 
+ 128Pr[13 | 0] + 128Pr[14 | 0] + 128Pr{15 | 0}). 
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B.S.T.J. BRIEF 


Solving Nonlinear Network Equations Using 
Optimization Techniques 


By ALLEN GERSHO 
(Manuscript received September 10, 1969) 


A class of nonlinear equations arising in transistor network analysis, 
as well as in other areas, has the form 


fila) + >> a;;2; — b; = 0 C= 12) fee ot (1) 
j=1 
or in matrix notation 


F(x) + Ax —b = 0, (2) 


where the nonlinearities f;(-) are continuously differentiable, strictly 
monotone increasing functions. Results by Willson’ and Sandberg and 
Willson’’* on nonlinear networks have included broad conditions for the 
existence and uniqueness of a solution to equation (2). However, con- 
vergent computational algorithms for finding the solution have been 
given only for restricted subclasses of the class of equations that have 
unique solutions.*'?'*"? These subclasses are characterized by a variety 
of restrictions on the matrix A and on the type of nonlinearities. In this 
brief we show that a single convergent algorithm exists for solving these 
equations under conditions virtually as broad as the known existence 
and uniqueness conditions. Peripherally, we obtain under these condi- 
tions a conceptually simple proof of the existence of a solution. 

The approach is to use the old technique (probably due to Cauchy) 
of converting a root-finding problem to a minimization problem. Let 


r(x) = F(x) + Ax — b, (3) 
and define the scalar valued ‘potential’ function 
Q(x) = r"Br (4) 


where B is an arbitrarily chosen symmetric positive definite matrix and 
T denotes the transpose. Then Q(x) is positive unless x is a solution 
of equation (2). Consequently, minimizing Q(x) is equivalent to solving 
equation (2) if in fact the nonlinear equation (2) has a solution. 
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Since Q(x) is continuous, we may regard it as a continuous surface and 
observe that if 


Q(x) © as |x| > @ (5) 
the so-called ‘level sets’’, 


{x : Q(x) < c}, 


are bounded for each number c > 0 and there must exist a point x* where 
Q(x) attains a global minimum. Under what conditions will this mini- 
mum satisfy Q(x*) = 0 so that x* is a solution of equation (2)? From 
equations (3) and (4) the gradient of Q is easily found to be 


VQ(x) = 2(Dx + A*)Br (6) 


where D, is the positive diagonal matrix whose 7th diagonal element is 
f{i(@;) where the prime denotes differentiation. Since the gradient must 
be zero at a minimum, either (2) 


r(x*) = 0, 
or (22) 
det {D, + A} =O at x = x*. 
If A is in the class of matrices P, characterized by the property” 
det {D + A} #4 0 for all diagonal matrices D > 0, (7) 


it follows that condition (z) holds so that x* is a solution of equation (3) 
for A in P, if condition (5) is satisfied. But Theorem 5 of Ref. 2 implies 
that condition (5) is satisfied if A is in P, and the range of the non- 
linearities f;(-) is the entire real line.* Uniqueness of the solution of 
equation (2) is very simply shown in Ref. 2. Reference 3 shows that the 
basic condition, A in Py , is satisfied for large classes of transistor net- 
works. 

The minimum of a continuously differentiable function with bounded 
level sets can always be found by a gradient descent algorithm when the 
gradient has a unique root.° No assumption regarding convexity or the 
behavior of the Hessian matrix is necessary. Clearly, a sufficiently 
small change in x in the negative gradient direction will always decrease 
the potential Q(x) unless x is already at a minimum. A sequence of itera- 
tions of this type, that is, 


* Recently Sandberg® has shown that condition (5) holds without any is 
ments on the range of the nonlinearities if A is nonsingular as well as in 
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Xre1 = X — x VOX), (8) 


monotonically reduces the potential Q(x) and yields a bounded sequence 
of points x, because the level sets are bounded. Convergence of the 
algorithm (8) is assured if the step sizes can be made large enough so that 
the potential Q(x,) approaches zero rather than a positive limit. This 
can be achieved by making y, depend on the size of the gradient in such 
a way that y, cannot approach zero unless the gradient is approaching 
zero. Goldstein® gives the following procedure for selecting y, . Define 
the normalized potential drop: 
Q(x) — Q[x — vy VQ(x)] 

g(x, 7) =F y || VQ(x) ||? ’ Y > 0, (9) 
a continuous function of y which assumes all values between 1 and 0 as 
ranges between zero and some positive value. Then for any 6 with 


0<56 <4 
choose y; so that 
0S Ges 7) a1 — 6 (10) 


if g(X, ,Yx) < 6; otherwise let y, = 1. Note that y, can be chosen by trial 
and error computation in each iteration. For small 6 few trials are neces- 
sary; but the resulting drop in potential in each iteration is smaller so 
that more iterations are needed. With this method of choosing +, , con- 
vergence of the algorithm (8) is assured for any starting point x. 

In summary, using the optimization approach and a result of Ref. 2 
we have shown the existence of a solution to equation (2) and the 
availability of a convergent algorithm to find the solution under the 
following conditions. 


(I) the nonlinearities f;(-) are continuously differentiable, strictly 
monotone increasing, and map the whole real line onto itself, and 

(II) the matrix A is in the class Py . 

The original existence conditions given in Ref. 2 do not include the 
“continuously differentiable” assumption but are otherwise identical to 
conditions I and II above. 
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